======================== START TIME: Sat Jul 6 09:31:48 UTC 2024 python3 version = Python 3.10.14 ======================== The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well. Token is valid (permission: write). Your token has been saved to /admin/home/ferdinand_mom/.cache/huggingface/token Login successful fatal: Unable to create '/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/.git/index.lock': File exists. Another git process seems to be running in this repository, e.g. an editor opened by 'git commit'. Please make sure all processes are terminated then try again. If it still fails, a git process may have crashed in this repository earlier: remove the file manually to continue. Job status: RUNNING [2024-07-06 09:31:51,331] torch.distributed.run: [WARNING] [2024-07-06 09:31:51,331] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:31:51,331] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-07-06 09:31:51,331] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:31:51,347] torch.distributed.run: [WARNING] [2024-07-06 09:31:51,347] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:31:51,347] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-07-06 09:31:51,347] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:31:51,348] torch.distributed.run: [WARNING] [2024-07-06 09:31:51,348] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:31:51,348] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-07-06 09:31:51,348] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:31:51,359] torch.distributed.run: [WARNING] [2024-07-06 09:31:51,359] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:31:51,359] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-07-06 09:31:51,359] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:31:51,370] torch.distributed.run: [WARNING] [2024-07-06 09:31:51,370] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:31:51,370] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-07-06 09:31:51,370] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:31:51,382] torch.distributed.run: [WARNING] [2024-07-06 09:31:51,382] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:31:51,382] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-07-06 09:31:51,382] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:31:56,377] torch.distributed.run: [WARNING] [2024-07-06 09:31:56,377] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:31:56,377] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-07-06 09:31:56,377] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:31:57,066] torch.distributed.run: [WARNING] [2024-07-06 09:31:57,066] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:31:57,066] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-07-06 09:31:57,066] torch.distributed.run: [WARNING] ***************************************** [default0]:07/06/2024 09:32:16 [WARNING|DP=0|PP=0|TP=0|ip-26-0-163-236]: [Vocab Size Padding] Padded vocab (size: 50257) with 1 dummy tokens (new size: 50258) [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: Config: [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: Config(general=GeneralArgs(project='bench_cluster', [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: run='%date_%jobid', [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: seed=42, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: step=None, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: consumed_train_samples=None, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: benchmark_csv_path=None, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: ignore_sanity_checks=True), [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: parallelism=ParallelismArgs(dp=1, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: pp=32, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: tp=2, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: pp_engine=, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: tp_mode=, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: tp_linear_async_communication=False, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: expert_parallel_size=1), [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: model=ModelArgs(model_config=LlamaConfig(bos_token_id=1, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: eos_token_id=2, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: hidden_act='silu', [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: hidden_size=2048, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: initializer_range=0.02, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: intermediate_size=4096, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: is_llama_config=True, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: max_position_embeddings=4096, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: num_attention_heads=32, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: num_hidden_layers=24, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: num_key_value_heads=32, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: pad_token_id=None, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: pretraining_tp=1, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: rms_norm_eps=1e-05, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: rope_scaling=None, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: rope_theta=10000.0, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: tie_word_embeddings=True, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: use_cache=True, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: vocab_size=50258), [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: init_method=RandomInit(std=0.025), [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: dtype=torch.bfloat16, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: make_vocab_size_divisible_by=1, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: ddp_bucket_cap_mb=25), [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: tokenizer=TokenizerArgs(tokenizer_name_or_path='openai-community/gpt2', [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: tokenizer_revision=None, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: tokenizer_max_length=None), [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: checkpoints=CheckpointsArgs(checkpoints_path=PosixPath('/dev/null'), [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: checkpoint_interval=100000, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: save_initial_state=False, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: resume_checkpoint_path=None, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: checkpoints_path_is_shared_file_system=False), [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: logging=LoggingArgs(log_level='info', [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: log_level_replica='info', [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: iteration_step_info_interval=1), [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: tokens=TokensArgs(sequence_length=4096, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: train_steps=20, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: micro_batch_size=8, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: batch_accumulation_per_replica=128, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: val_check_interval=-1, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: limit_val_batches=0, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: limit_test_batches=0), [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: optimizer=OptimizerArgs(optimizer_factory=AdamWOptimizerArgs(adam_eps=1e-08, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: adam_beta1=0.9, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: adam_beta2=0.95, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: torch_adam_is_fused=True, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: name='adamW'), [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: zero_stage=1, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: weight_decay=0.01, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: clip_grad=1.0, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: accumulate_grad_in_fp32=True, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: learning_rate_scheduler=LRSchedulerArgs(learning_rate=0.0001, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: lr_warmup_steps=1, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: lr_warmup_style='linear', [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: lr_decay_style='linear', [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: lr_decay_steps=19, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: lr_decay_starting_step=None, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: min_decay_lr=1e-05)), [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: data_stages=[DatasetStageArgs(name='Training Stage', [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: start_training_step=1, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: data=DataArgs(dataset=PretrainDatasetsArgs(hf_dataset_or_datasets='roneneldan/TinyStories', [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: hf_dataset_splits='train', [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: hf_dataset_config_name=None, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: dataset_processing_num_proc_per_process=64, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: dataset_overwrite_cache=False, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: text_column_name='text'), [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: seed=42, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: num_loading_workers=0))], [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: profiler=ProfilerArgs(profiler_export_path=PosixPath('/fsx/ferdinandmom/ferdinand-hf/bench_cluster/results/llama-1B/64_GPUS/dp-1_tp-2_pp-32_mbz-8')), [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: lighteval=None) [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: Model Config: [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: LlamaConfig(bos_token_id=1, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: eos_token_id=2, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: hidden_act='silu', [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: hidden_size=2048, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: initializer_range=0.02, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: intermediate_size=4096, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: is_llama_config=True, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: max_position_embeddings=4096, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: num_attention_heads=32, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: num_hidden_layers=24, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: num_key_value_heads=32, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: pad_token_id=None, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: pretraining_tp=1, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: rms_norm_eps=1e-05, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: rope_scaling=None, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: rope_theta=10000.0, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: tie_word_embeddings=True, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: use_cache=True, [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: vocab_size=50258) [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: Building model.. [default0]:07/06/2024 09:32:16 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: Setting PP block ranks... [default5]:07/06/2024 09:32:34 [INFO|DP=0|PP=14|TP=1|ip-26-0-164-45]: Local number of parameters: 21M (40.01MiB) [default5]:07/06/2024 09:32:34 [INFO|DP=0|PP=14|TP=1|ip-26-0-164-45]: [After model building] Memory usage: 41.02MiB. Peak allocated: 43.05MiB Peak reserved: 56.00MiB [default5]:07/06/2024 09:32:34 [INFO|DP=0|PP=14|TP=1|ip-26-0-164-45]: No checkpoint path provided. [default1]:07/06/2024 09:32:34 [INFO|DP=0|PP=20|TP=1|ip-26-0-168-120]: Local number of parameters: 21M (40.01MiB) [default1]:07/06/2024 09:32:34 [INFO|DP=0|PP=20|TP=1|ip-26-0-168-120]: [After model building] Memory usage: 41.02MiB. Peak allocated: 43.05MiB Peak reserved: 56.00MiB [default1]:07/06/2024 09:32:34 [INFO|DP=0|PP=20|TP=1|ip-26-0-168-120]: No checkpoint path provided. [default5]:07/06/2024 09:32:34 [INFO|DP=0|PP=22|TP=1|ip-26-0-168-120]: Local number of parameters: 21M (40.01MiB) [default5]:07/06/2024 09:32:34 [INFO|DP=0|PP=22|TP=1|ip-26-0-168-120]: [After model building] Memory usage: 41.02MiB. Peak allocated: 43.05MiB Peak reserved: 56.00MiB [default5]:07/06/2024 09:32:34 [INFO|DP=0|PP=22|TP=1|ip-26-0-168-120]: No checkpoint path provided. [default6]:07/06/2024 09:32:34 [INFO|DP=0|PP=15|TP=0|ip-26-0-164-45]: Local number of parameters: 21M (40.01MiB) [default6]:07/06/2024 09:32:34 [INFO|DP=0|PP=15|TP=0|ip-26-0-164-45]: [After model building] Memory usage: 41.02MiB. Peak allocated: 43.05MiB Peak reserved: 56.00MiB [default3]:07/06/2024 09:32:34 [INFO|DP=0|PP=21|TP=1|ip-26-0-168-120]: Local number of parameters: 21M (40.01MiB) [default3]:07/06/2024 09:32:34 [INFO|DP=0|PP=21|TP=1|ip-26-0-168-120]: [After model building] Memory usage: 41.02MiB. Peak allocated: 43.05MiB Peak reserved: 56.00MiB [default3]:07/06/2024 09:32:34 [INFO|DP=0|PP=21|TP=1|ip-26-0-168-120]: No checkpoint path provided. [default6]:07/06/2024 09:32:34 [INFO|DP=0|PP=15|TP=0|ip-26-0-164-45]: No checkpoint path provided. [default6]:07/06/2024 09:32:34 [INFO|DP=0|PP=23|TP=0|ip-26-0-168-120]: Local number of parameters: 21M (40.01MiB) [default6]:07/06/2024 09:32:34 [INFO|DP=0|PP=23|TP=0|ip-26-0-168-120]: [After model building] Memory usage: 41.02MiB. Peak allocated: 43.05MiB Peak reserved: 56.00MiB [default6]:07/06/2024 09:32:34 [INFO|DP=0|PP=23|TP=0|ip-26-0-168-120]: No checkpoint path provided. [default4]:07/06/2024 09:32:34 [INFO|DP=0|PP=22|TP=0|ip-26-0-168-120]: Local number of parameters: 21M (40.01MiB) [default4]:07/06/2024 09:32:34 [INFO|DP=0|PP=22|TP=0|ip-26-0-168-120]: [After model building] Memory usage: 41.02MiB. Peak allocated: 43.05MiB Peak reserved: 56.00MiB [default4]:07/06/2024 09:32:34 [INFO|DP=0|PP=22|TP=0|ip-26-0-168-120]: No checkpoint path provided. [default6]:07/06/2024 09:32:34 [INFO|DP=0|PP=11|TP=0|ip-26-0-164-187]: Local number of parameters: 21M (40.01MiB) [default6]:07/06/2024 09:32:34 [INFO|DP=0|PP=11|TP=0|ip-26-0-164-187]: [After model building] Memory usage: 41.02MiB. Peak allocated: 43.05MiB Peak reserved: 56.00MiB [default6]:07/06/2024 09:32:34 [INFO|DP=0|PP=11|TP=0|ip-26-0-164-187]: No checkpoint path provided. [default0]:07/06/2024 09:32:34 [INFO|DP=0|PP=20|TP=0|ip-26-0-168-120]: Local number of parameters: 21M (40.01MiB) [default0]:07/06/2024 09:32:34 [INFO|DP=0|PP=20|TP=0|ip-26-0-168-120]: [After model building] Memory usage: 41.02MiB. Peak allocated: 43.05MiB Peak reserved: 56.00MiB [default1]:07/06/2024 09:32:34 [INFO|DP=0|PP=8|TP=1|ip-26-0-164-187]: Local number of parameters: 21M (40.01MiB) [default1]:07/06/2024 09:32:34 [INFO|DP=0|PP=8|TP=1|ip-26-0-164-187]: [After model building] Memory usage: 41.02MiB. Peak allocated: 43.05MiB Peak reserved: 56.00MiB [default1]:07/06/2024 09:32:34 [INFO|DP=0|PP=8|TP=1|ip-26-0-164-187]: No checkpoint path provided. [default0]:07/06/2024 09:32:34 [INFO|DP=0|PP=20|TP=0|ip-26-0-168-120]: No checkpoint path provided. [default2]:07/06/2024 09:32:34 [INFO|DP=0|PP=17|TP=0|ip-26-0-164-75]: Local number of parameters: 21M (40.01MiB) [default2]:07/06/2024 09:32:34 [INFO|DP=0|PP=17|TP=0|ip-26-0-164-75]: [After model building] Memory usage: 41.02MiB. Peak allocated: 43.05MiB Peak reserved: 56.00MiB [default2]:07/06/2024 09:32:34 [INFO|DP=0|PP=17|TP=0|ip-26-0-164-75]: No checkpoint path provided. [default0]:07/06/2024 09:32:34 [INFO|DP=0|PP=16|TP=0|ip-26-0-164-75]: Local number of parameters: 21M (40.01MiB) [default0]:07/06/2024 09:32:34 [INFO|DP=0|PP=16|TP=0|ip-26-0-164-75]: [After model building] Memory usage: 41.02MiB. Peak allocated: 43.05MiB Peak reserved: 56.00MiB [default0]:07/06/2024 09:32:34 [INFO|DP=0|PP=16|TP=0|ip-26-0-164-75]: No checkpoint path provided. [default4]:07/06/2024 09:32:34 [INFO|DP=0|PP=18|TP=0|ip-26-0-164-75]: Local number of parameters: 21M (40.01MiB) [default4]:07/06/2024 09:32:34 [INFO|DP=0|PP=18|TP=0|ip-26-0-164-75]: [After model building] Memory usage: 41.02MiB. Peak allocated: 43.05MiB Peak reserved: 56.00MiB [default4]:07/06/2024 09:32:34 [INFO|DP=0|PP=18|TP=0|ip-26-0-164-75]: No checkpoint path provided. [default3]:07/06/2024 09:32:34 [INFO|DP=0|PP=17|TP=1|ip-26-0-164-75]: Local number of parameters: 21M (40.01MiB) [default3]:07/06/2024 09:32:34 [INFO|DP=0|PP=17|TP=1|ip-26-0-164-75]: [After model building] Memory usage: 41.02MiB. Peak allocated: 43.05MiB Peak reserved: 56.00MiB [default0]:07/06/2024 09:32:34 [INFO|DP=0|PP=8|TP=0|ip-26-0-164-187]: Local number of parameters: 21M (40.01MiB) [default0]:07/06/2024 09:32:34 [INFO|DP=0|PP=8|TP=0|ip-26-0-164-187]: [After model building] Memory usage: 41.02MiB. Peak allocated: 43.05MiB Peak reserved: 56.00MiB [default0]:07/06/2024 09:32:34 [INFO|DP=0|PP=8|TP=0|ip-26-0-164-187]: No checkpoint path provided. [default4]:07/06/2024 09:32:34 [INFO|DP=0|PP=6|TP=0|ip-26-0-164-18]: Local number of parameters: 21M (40.01MiB) [default4]:07/06/2024 09:32:34 [INFO|DP=0|PP=6|TP=0|ip-26-0-164-18]: [After model building] Memory usage: 41.02MiB. Peak allocated: 43.05MiB Peak reserved: 56.00MiB [default4]:07/06/2024 09:32:34 [INFO|DP=0|PP=6|TP=0|ip-26-0-164-18]: No checkpoint path provided. [default3]:07/06/2024 09:32:34 [INFO|DP=0|PP=17|TP=1|ip-26-0-164-75]: No checkpoint path provided. [default0]:07/06/2024 09:32:34 [INFO|DP=0|PP=4|TP=0|ip-26-0-164-18]: Local number of parameters: 21M (40.01MiB) [default0]:07/06/2024 09:32:34 [INFO|DP=0|PP=4|TP=0|ip-26-0-164-18]: [After model building] Memory usage: 41.02MiB. Peak allocated: 43.05MiB Peak reserved: 56.00MiB [default0]:07/06/2024 09:32:34 [INFO|DP=0|PP=4|TP=0|ip-26-0-164-18]: No checkpoint path provided. [default7]:07/06/2024 09:32:34 [INFO|DP=0|PP=19|TP=1|ip-26-0-164-75]: Local number of parameters: 21M (40.01MiB) [default7]:07/06/2024 09:32:34 [INFO|DP=0|PP=19|TP=1|ip-26-0-164-75]: [After model building] Memory usage: 41.02MiB. Peak allocated: 43.05MiB Peak reserved: 56.00MiB [default6]:07/06/2024 09:32:34 [INFO|DP=0|PP=3|TP=0|ip-26-0-163-236]: Local number of parameters: 21M (40.01MiB) [default6]:07/06/2024 09:32:34 [INFO|DP=0|PP=3|TP=0|ip-26-0-163-236]: [After model building] Memory usage: 41.02MiB. Peak allocated: 43.05MiB Peak reserved: 56.00MiB [default6]:07/06/2024 09:32:34 [INFO|DP=0|PP=3|TP=0|ip-26-0-163-236]: No checkpoint path provided. [default2]:07/06/2024 09:32:34 [INFO|DP=0|PP=21|TP=0|ip-26-0-168-120]: Local number of parameters: 21M (40.01MiB) [default5]:07/06/2024 09:32:34 [INFO|DP=0|PP=6|TP=1|ip-26-0-164-18]: Local number of parameters: 21M (40.01MiB) [default5]:07/06/2024 09:32:34 [INFO|DP=0|PP=6|TP=1|ip-26-0-164-18]: [After model building] Memory usage: 41.02MiB. Peak allocated: 43.05MiB Peak reserved: 56.00MiB [default5]:07/06/2024 09:32:34 [INFO|DP=0|PP=6|TP=1|ip-26-0-164-18]: No checkpoint path provided. [default7]:07/06/2024 09:32:34 [INFO|DP=0|PP=19|TP=1|ip-26-0-164-75]: No checkpoint path provided. [default0]:07/06/2024 09:32:34 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: Total number of parameters: 1.21G (2313.02MiB) [default0]:07/06/2024 09:32:34 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: Local number of parameters: 72.4M (138.17MiB) [default7]:07/06/2024 09:32:34 [INFO|DP=0|PP=31|TP=1|ip-26-0-173-7]: Local number of parameters: 0 (0.00MiB) [default7]:07/06/2024 09:32:34 [INFO|DP=0|PP=31|TP=1|ip-26-0-173-7]: [After model building] Memory usage: 0.01MiB. Peak allocated: 0.03MiB Peak reserved: 2.00MiB [default2]:07/06/2024 09:32:34 [INFO|DP=0|PP=21|TP=0|ip-26-0-168-120]: [After model building] Memory usage: 41.02MiB. Peak allocated: 43.05MiB Peak reserved: 56.00MiB [default7]:07/06/2024 09:32:34 [INFO|DP=0|PP=23|TP=1|ip-26-0-168-120]: Local number of parameters: 21M (40.01MiB) [default3]:07/06/2024 09:32:34 [INFO|DP=0|PP=5|TP=1|ip-26-0-164-18]: Local number of parameters: 21M (40.01MiB) [default3]:07/06/2024 09:32:34 [INFO|DP=0|PP=5|TP=1|ip-26-0-164-18]: [After model building] Memory usage: 41.02MiB. Peak allocated: 43.05MiB Peak reserved: 56.00MiB [default3]:07/06/2024 09:32:34 [INFO|DP=0|PP=5|TP=1|ip-26-0-164-18]: No checkpoint path provided. [default6]:07/06/2024 09:32:34 [INFO|DP=0|PP=19|TP=0|ip-26-0-164-75]: Local number of parameters: 21M (40.01MiB) [default0]:07/06/2024 09:32:34 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: [After model building] Memory usage: 139.18MiB. Peak allocated: 141.21MiB Peak reserved: 156.00MiB [default6]:07/06/2024 09:32:34 [INFO|DP=0|PP=31|TP=0|ip-26-0-173-7]: Local number of parameters: 0 (0.00MiB) [default6]:07/06/2024 09:32:34 [INFO|DP=0|PP=31|TP=0|ip-26-0-173-7]: [After model building] Memory usage: 0.01MiB. Peak allocated: 0.03MiB Peak reserved: 2.00MiB [default1]:07/06/2024 09:32:34 [INFO|DP=0|PP=28|TP=1|ip-26-0-173-7]: Local number of parameters: 0 (0.00MiB) [default1]:07/06/2024 09:32:34 [INFO|DP=0|PP=28|TP=1|ip-26-0-173-7]: [After model building] Memory usage: 0.01MiB. Peak allocated: 0.03MiB Peak reserved: 2.00MiB [default2]:07/06/2024 09:32:34 [INFO|DP=0|PP=21|TP=0|ip-26-0-168-120]: No checkpoint path provided. [default7]:07/06/2024 09:32:34 [INFO|DP=0|PP=23|TP=1|ip-26-0-168-120]: [After model building] Memory usage: 41.02MiB. Peak allocated: 43.05MiB Peak reserved: 56.00MiB [default7]:07/06/2024 09:32:34 [INFO|DP=0|PP=23|TP=1|ip-26-0-168-120]: No checkpoint path provided. [default2]:07/06/2024 09:32:34 [INFO|DP=0|PP=5|TP=0|ip-26-0-164-18]: Local number of parameters: 21M (40.01MiB) [default2]:07/06/2024 09:32:34 [INFO|DP=0|PP=5|TP=0|ip-26-0-164-18]: [After model building] Memory usage: 41.02MiB. Peak allocated: 43.05MiB Peak reserved: 56.00MiB [default2]:07/06/2024 09:32:34 [INFO|DP=0|PP=5|TP=0|ip-26-0-164-18]: No checkpoint path provided. [default6]:07/06/2024 09:32:34 [INFO|DP=0|PP=19|TP=0|ip-26-0-164-75]: [After model building] Memory usage: 41.02MiB. Peak allocated: 43.05MiB Peak reserved: 56.00MiB [default6]:07/06/2024 09:32:34 [INFO|DP=0|PP=19|TP=0|ip-26-0-164-75]: No checkpoint path provided. [default0]:07/06/2024 09:32:34 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: No checkpoint path provided. [default0]:07/06/2024 09:32:34 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: Parametrizing model parameters using StandardParametrizator [default1]:07/06/2024 09:32:34 [INFO|DP=0|PP=28|TP=1|ip-26-0-173-7]: No checkpoint path provided. [default7]:07/06/2024 09:32:34 [INFO|DP=0|PP=31|TP=1|ip-26-0-173-7]: No checkpoint path provided. [default1]:07/06/2024 09:32:34 [INFO|DP=0|PP=4|TP=1|ip-26-0-164-18]: Local number of parameters: 21M (40.01MiB) [default1]:07/06/2024 09:32:34 [INFO|DP=0|PP=4|TP=1|ip-26-0-164-18]: [After model building] Memory usage: 41.02MiB. Peak allocated: 43.05MiB Peak reserved: 56.00MiB [default1]:07/06/2024 09:32:34 [INFO|DP=0|PP=4|TP=1|ip-26-0-164-18]: No checkpoint path provided. [default4]:07/06/2024 09:32:34 [INFO|DP=0|PP=26|TP=0|ip-26-0-172-116]: Local number of parameters: 0 (0.00MiB) [default7]:07/06/2024 09:32:34 [INFO|DP=0|PP=27|TP=1|ip-26-0-172-116]: Local number of parameters: 0 (0.00MiB) [default4]:07/06/2024 09:32:34 [INFO|DP=0|PP=26|TP=0|ip-26-0-172-116]: [After model building] Memory usage: 0.01MiB. Peak allocated: 0.03MiB Peak reserved: 2.00MiB [default4]:07/06/2024 09:32:34 [INFO|DP=0|PP=26|TP=0|ip-26-0-172-116]: No checkpoint path provided. [default7]:07/06/2024 09:32:34 [INFO|DP=0|PP=27|TP=1|ip-26-0-172-116]: [After model building] Memory usage: 0.01MiB. Peak allocated: 0.03MiB Peak reserved: 2.00MiB [default1]:07/06/2024 09:32:34 [INFO|DP=0|PP=16|TP=1|ip-26-0-164-75]: Local number of parameters: 21M (40.01MiB) [default4]:07/06/2024 09:32:34 [INFO|DP=0|PP=2|TP=0|ip-26-0-163-236]: Local number of parameters: 21M (40.01MiB) [default6]:07/06/2024 09:32:34 [INFO|DP=0|PP=31|TP=0|ip-26-0-173-7]: No checkpoint path provided. [default5]:07/06/2024 09:32:34 [INFO|DP=0|PP=10|TP=1|ip-26-0-164-187]: Local number of parameters: 21M (40.01MiB) [default5]:07/06/2024 09:32:34 [INFO|DP=0|PP=10|TP=1|ip-26-0-164-187]: [After model building] Memory usage: 41.02MiB. Peak allocated: 43.05MiB Peak reserved: 56.00MiB [default5]:07/06/2024 09:32:34 [INFO|DP=0|PP=10|TP=1|ip-26-0-164-187]: No checkpoint path provided. [default6]:07/06/2024 09:32:34 [INFO|DP=0|PP=7|TP=0|ip-26-0-164-18]: Local number of parameters: 21M (40.01MiB) [default6]:07/06/2024 09:32:34 [INFO|DP=0|PP=7|TP=0|ip-26-0-164-18]: [After model building] Memory usage: 41.02MiB. Peak allocated: 43.05MiB Peak reserved: 56.00MiB [default7]:07/06/2024 09:32:34 [INFO|DP=0|PP=27|TP=1|ip-26-0-172-116]: No checkpoint path provided. [default1]:07/06/2024 09:32:34 [INFO|DP=0|PP=16|TP=1|ip-26-0-164-75]: [After model building] Memory usage: 41.02MiB. Peak allocated: 43.05MiB Peak reserved: 56.00MiB [default1]:07/06/2024 09:32:34 [INFO|DP=0|PP=16|TP=1|ip-26-0-164-75]: No checkpoint path provided. [default4]:07/06/2024 09:32:34 [INFO|DP=0|PP=2|TP=0|ip-26-0-163-236]: [After model building] Memory usage: 41.02MiB. Peak allocated: 43.05MiB Peak reserved: 56.00MiB [default4]:07/06/2024 09:32:34 [INFO|DP=0|PP=2|TP=0|ip-26-0-163-236]: No checkpoint path provided. [default0]:07/06/2024 09:32:34 [INFO|DP=0|PP=28|TP=0|ip-26-0-173-7]: Local number of parameters: 0 (0.00MiB) [default0]:07/06/2024 09:32:34 [INFO|DP=0|PP=28|TP=0|ip-26-0-173-7]: [After model building] Memory usage: 0.01MiB. Peak allocated: 0.03MiB Peak reserved: 2.00MiB [default0]:07/06/2024 09:32:34 [INFO|DP=0|PP=28|TP=0|ip-26-0-173-7]: No checkpoint path provided. [default6]:07/06/2024 09:32:34 [INFO|DP=0|PP=7|TP=0|ip-26-0-164-18]: No checkpoint path provided. [default5]:07/06/2024 09:32:34 [INFO|DP=0|PP=26|TP=1|ip-26-0-172-116]: Local number of parameters: 0 (0.00MiB) [default5]:07/06/2024 09:32:34 [INFO|DP=0|PP=18|TP=1|ip-26-0-164-75]: Local number of parameters: 21M (40.01MiB) [default5]:07/06/2024 09:32:34 [INFO|DP=0|PP=18|TP=1|ip-26-0-164-75]: [After model building] Memory usage: 41.02MiB. Peak allocated: 43.05MiB Peak reserved: 56.00MiB [default7]:07/06/2024 09:32:34 [INFO|DP=0|PP=3|TP=1|ip-26-0-163-236]: Local number of parameters: 21M (40.01MiB) [default5]:07/06/2024 09:32:34 [INFO|DP=0|PP=26|TP=1|ip-26-0-172-116]: [After model building] Memory usage: 0.01MiB. Peak allocated: 0.03MiB Peak reserved: 2.00MiB [default5]:07/06/2024 09:32:34 [INFO|DP=0|PP=26|TP=1|ip-26-0-172-116]: No checkpoint path provided. [default5]:07/06/2024 09:32:34 [INFO|DP=0|PP=18|TP=1|ip-26-0-164-75]: No checkpoint path provided. [default7]:07/06/2024 09:32:34 [INFO|DP=0|PP=3|TP=1|ip-26-0-163-236]: [After model building] Memory usage: 41.02MiB. Peak allocated: 43.05MiB Peak reserved: 56.00MiB [default2]:07/06/2024 09:32:34 [INFO|DP=0|PP=25|TP=0|ip-26-0-172-116]: Local number of parameters: 51.5M (98.16MiB) [default2]:07/06/2024 09:32:34 [INFO|DP=0|PP=25|TP=0|ip-26-0-172-116]: [After model building] Memory usage: 98.17MiB. Peak allocated: 98.19MiB Peak reserved: 102.00MiB [default2]:07/06/2024 09:32:34 [INFO|DP=0|PP=25|TP=0|ip-26-0-172-116]: No checkpoint path provided. [default7]:07/06/2024 09:32:34 [INFO|DP=0|PP=3|TP=1|ip-26-0-163-236]: No checkpoint path provided. [default2]:07/06/2024 09:32:34 [INFO|DP=0|PP=1|TP=0|ip-26-0-163-236]: Local number of parameters: 21M (40.01MiB) [default3]:07/06/2024 09:32:34 [INFO|DP=0|PP=9|TP=1|ip-26-0-164-187]: Local number of parameters: 21M (40.01MiB) [default3]:07/06/2024 09:32:34 [INFO|DP=0|PP=25|TP=1|ip-26-0-172-116]: Local number of parameters: 51.5M (98.16MiB) [default3]:07/06/2024 09:32:34 [INFO|DP=0|PP=25|TP=1|ip-26-0-172-116]: [After model building] Memory usage: 98.17MiB. Peak allocated: 98.19MiB Peak reserved: 102.00MiB [default2]:07/06/2024 09:32:34 [INFO|DP=0|PP=1|TP=0|ip-26-0-163-236]: [After model building] Memory usage: 41.02MiB. Peak allocated: 43.05MiB Peak reserved: 56.00MiB [default2]:07/06/2024 09:32:34 [INFO|DP=0|PP=1|TP=0|ip-26-0-163-236]: No checkpoint path provided. [default2]:07/06/2024 09:32:34 [INFO|DP=0|PP=29|TP=0|ip-26-0-173-7]: Local number of parameters: 0 (0.00MiB) [default3]:07/06/2024 09:32:34 [INFO|DP=0|PP=9|TP=1|ip-26-0-164-187]: [After model building] Memory usage: 41.02MiB. Peak allocated: 43.05MiB Peak reserved: 56.00MiB [default3]:07/06/2024 09:32:34 [INFO|DP=0|PP=25|TP=1|ip-26-0-172-116]: No checkpoint path provided. [default5]:07/06/2024 09:32:34 [INFO|DP=0|PP=2|TP=1|ip-26-0-163-236]: Local number of parameters: 21M (40.01MiB) [default4]:07/06/2024 09:32:34 [INFO|DP=0|PP=30|TP=0|ip-26-0-173-7]: Local number of parameters: 0 (0.00MiB) [default7]:07/06/2024 09:32:34 [INFO|DP=0|PP=11|TP=1|ip-26-0-164-187]: Local number of parameters: 21M (40.01MiB) [default0]:07/06/2024 09:32:34 [INFO|DP=0|PP=24|TP=0|ip-26-0-172-116]: Local number of parameters: 2.05K (0.00MiB) [default5]:07/06/2024 09:32:34 [INFO|DP=0|PP=2|TP=1|ip-26-0-163-236]: [After model building] Memory usage: 41.02MiB. Peak allocated: 43.05MiB Peak reserved: 56.00MiB [default4]:07/06/2024 09:32:34 [INFO|DP=0|PP=30|TP=0|ip-26-0-173-7]: [After model building] Memory usage: 0.01MiB. Peak allocated: 0.03MiB Peak reserved: 2.00MiB [default7]:07/06/2024 09:32:34 [INFO|DP=0|PP=11|TP=1|ip-26-0-164-187]: [After model building] Memory usage: 41.02MiB. Peak allocated: 43.05MiB Peak reserved: 56.00MiB [default7]:07/06/2024 09:32:34 [INFO|DP=0|PP=11|TP=1|ip-26-0-164-187]: No checkpoint path provided. [default0]:07/06/2024 09:32:34 [INFO|DP=0|PP=24|TP=0|ip-26-0-172-116]: [After model building] Memory usage: 0.01MiB. Peak allocated: 0.03MiB Peak reserved: 2.00MiB [default0]:07/06/2024 09:32:34 [INFO|DP=0|PP=24|TP=0|ip-26-0-172-116]: No checkpoint path provided. [default1]:07/06/2024 09:32:34 [INFO|DP=0|PP=0|TP=1|ip-26-0-163-236]: Local number of parameters: 72.4M (138.17MiB) [default4]:07/06/2024 09:32:34 [INFO|DP=0|PP=30|TP=0|ip-26-0-173-7]: No checkpoint path provided. [default5]:07/06/2024 09:32:34 [INFO|DP=0|PP=30|TP=1|ip-26-0-173-7]: Local number of parameters: 0 (0.00MiB) [default3]:07/06/2024 09:32:34 [INFO|DP=0|PP=9|TP=1|ip-26-0-164-187]: No checkpoint path provided. [default4]:07/06/2024 09:32:34 [INFO|DP=0|PP=10|TP=0|ip-26-0-164-187]: Local number of parameters: 21M (40.01MiB) [default6]:07/06/2024 09:32:34 [INFO|DP=0|PP=27|TP=0|ip-26-0-172-116]: Local number of parameters: 0 (0.00MiB) [default6]:07/06/2024 09:32:34 [INFO|DP=0|PP=27|TP=0|ip-26-0-172-116]: [After model building] Memory usage: 0.01MiB. Peak allocated: 0.03MiB Peak reserved: 2.00MiB [default3]:07/06/2024 09:32:34 [INFO|DP=0|PP=1|TP=1|ip-26-0-163-236]: Local number of parameters: 21M (40.01MiB) [default5]:07/06/2024 09:32:34 [INFO|DP=0|PP=30|TP=1|ip-26-0-173-7]: [After model building] Memory usage: 0.01MiB. Peak allocated: 0.03MiB Peak reserved: 2.00MiB [default5]:07/06/2024 09:32:34 [INFO|DP=0|PP=30|TP=1|ip-26-0-173-7]: No checkpoint path provided. [default4]:07/06/2024 09:32:34 [INFO|DP=0|PP=10|TP=0|ip-26-0-164-187]: [After model building] Memory usage: 41.02MiB. Peak allocated: 43.05MiB Peak reserved: 56.00MiB [default4]:07/06/2024 09:32:34 [INFO|DP=0|PP=10|TP=0|ip-26-0-164-187]: No checkpoint path provided. [default2]:07/06/2024 09:32:34 [INFO|DP=0|PP=9|TP=0|ip-26-0-164-187]: Local number of parameters: 21M (40.01MiB) [default2]:07/06/2024 09:32:34 [INFO|DP=0|PP=9|TP=0|ip-26-0-164-187]: [After model building] Memory usage: 41.02MiB. Peak allocated: 43.05MiB Peak reserved: 56.00MiB [default2]:07/06/2024 09:32:34 [INFO|DP=0|PP=9|TP=0|ip-26-0-164-187]: No checkpoint path provided. [default7]:07/06/2024 09:32:34 [INFO|DP=0|PP=7|TP=1|ip-26-0-164-18]: Local number of parameters: 21M (40.01MiB) [default6]:07/06/2024 09:32:34 [INFO|DP=0|PP=27|TP=0|ip-26-0-172-116]: No checkpoint path provided. [default1]:07/06/2024 09:32:34 [INFO|DP=0|PP=0|TP=1|ip-26-0-163-236]: [After model building] Memory usage: 139.18MiB. Peak allocated: 141.21MiB Peak reserved: 156.00MiB [default2]:07/06/2024 09:32:34 [INFO|DP=0|PP=29|TP=0|ip-26-0-173-7]: [After model building] Memory usage: 0.01MiB. Peak allocated: 0.03MiB Peak reserved: 2.00MiB [default7]:07/06/2024 09:32:34 [INFO|DP=0|PP=7|TP=1|ip-26-0-164-18]: [After model building] Memory usage: 41.02MiB. Peak allocated: 43.05MiB Peak reserved: 56.00MiB [default7]:07/06/2024 09:32:34 [INFO|DP=0|PP=7|TP=1|ip-26-0-164-18]: No checkpoint path provided. [default1]:07/06/2024 09:32:34 [INFO|DP=0|PP=24|TP=1|ip-26-0-172-116]: Local number of parameters: 2.05K (0.00MiB) [default1]:07/06/2024 09:32:34 [INFO|DP=0|PP=24|TP=1|ip-26-0-172-116]: [After model building] Memory usage: 0.01MiB. Peak allocated: 0.03MiB Peak reserved: 2.00MiB [default1]:07/06/2024 09:32:34 [INFO|DP=0|PP=24|TP=1|ip-26-0-172-116]: No checkpoint path provided. [default1]:07/06/2024 09:32:34 [INFO|DP=0|PP=0|TP=1|ip-26-0-163-236]: No checkpoint path provided. [default3]:07/06/2024 09:32:34 [INFO|DP=0|PP=29|TP=1|ip-26-0-173-7]: Local number of parameters: 0 (0.00MiB) [default3]:07/06/2024 09:32:34 [INFO|DP=0|PP=1|TP=1|ip-26-0-163-236]: [After model building] Memory usage: 41.02MiB. Peak allocated: 43.05MiB Peak reserved: 56.00MiB [default3]:07/06/2024 09:32:34 [INFO|DP=0|PP=1|TP=1|ip-26-0-163-236]: No checkpoint path provided. [default3]:07/06/2024 09:32:34 [INFO|DP=0|PP=29|TP=1|ip-26-0-173-7]: [After model building] Memory usage: 0.01MiB. Peak allocated: 0.03MiB Peak reserved: 2.00MiB [default2]:07/06/2024 09:32:34 [INFO|DP=0|PP=29|TP=0|ip-26-0-173-7]: No checkpoint path provided. [default5]:07/06/2024 09:32:34 [INFO|DP=0|PP=2|TP=1|ip-26-0-163-236]: No checkpoint path provided. [default3]:07/06/2024 09:32:34 [INFO|DP=0|PP=29|TP=1|ip-26-0-173-7]: No checkpoint path provided. [default2]:07/06/2024 09:32:34 [INFO|DP=0|PP=13|TP=0|ip-26-0-164-45]: Local number of parameters: 21M (40.01MiB) [default2]:07/06/2024 09:32:34 [INFO|DP=0|PP=13|TP=0|ip-26-0-164-45]: [After model building] Memory usage: 41.02MiB. Peak allocated: 43.05MiB Peak reserved: 56.00MiB [default2]:07/06/2024 09:32:34 [INFO|DP=0|PP=13|TP=0|ip-26-0-164-45]: No checkpoint path provided. [default0]:07/06/2024 09:32:34 [INFO|DP=0|PP=12|TP=0|ip-26-0-164-45]: Local number of parameters: 21M (40.01MiB) [default0]:07/06/2024 09:32:34 [INFO|DP=0|PP=12|TP=0|ip-26-0-164-45]: [After model building] Memory usage: 41.02MiB. Peak allocated: 43.05MiB Peak reserved: 56.00MiB [default0]:07/06/2024 09:32:34 [INFO|DP=0|PP=12|TP=0|ip-26-0-164-45]: No checkpoint path provided. [default1]:07/06/2024 09:32:34 [INFO|DP=0|PP=12|TP=1|ip-26-0-164-45]: Local number of parameters: 21M (40.01MiB) [default1]:07/06/2024 09:32:34 [INFO|DP=0|PP=12|TP=1|ip-26-0-164-45]: [After model building] Memory usage: 41.02MiB. Peak allocated: 43.05MiB Peak reserved: 56.00MiB [default1]:07/06/2024 09:32:34 [INFO|DP=0|PP=12|TP=1|ip-26-0-164-45]: No checkpoint path provided. [default7]:07/06/2024 09:32:34 [INFO|DP=0|PP=15|TP=1|ip-26-0-164-45]: Local number of parameters: 21M (40.01MiB) [default7]:07/06/2024 09:32:34 [INFO|DP=0|PP=15|TP=1|ip-26-0-164-45]: [After model building] Memory usage: 41.02MiB. Peak allocated: 43.05MiB Peak reserved: 56.00MiB [default7]:07/06/2024 09:32:34 [INFO|DP=0|PP=15|TP=1|ip-26-0-164-45]: No checkpoint path provided. [default3]:07/06/2024 09:32:34 [INFO|DP=0|PP=13|TP=1|ip-26-0-164-45]: Local number of parameters: 21M (40.01MiB) [default3]:07/06/2024 09:32:34 [INFO|DP=0|PP=13|TP=1|ip-26-0-164-45]: [After model building] Memory usage: 41.02MiB. Peak allocated: 43.05MiB Peak reserved: 56.00MiB [default3]:07/06/2024 09:32:34 [INFO|DP=0|PP=13|TP=1|ip-26-0-164-45]: No checkpoint path provided. [default4]:07/06/2024 09:32:34 [INFO|DP=0|PP=14|TP=0|ip-26-0-164-45]: Local number of parameters: 21M (40.01MiB) [default4]:07/06/2024 09:32:34 [INFO|DP=0|PP=14|TP=0|ip-26-0-164-45]: [After model building] Memory usage: 41.02MiB. Peak allocated: 43.05MiB Peak reserved: 56.00MiB [default4]:07/06/2024 09:32:34 [INFO|DP=0|PP=14|TP=0|ip-26-0-164-45]: No checkpoint path provided. [default0]:07/06/2024 09:32:35 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: [Optimizer Building] Using LearningRateForSP as learning rate [default0]:07/06/2024 09:32:35 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: [ZeRO sharding] Size of optimizer params per rank: [default0]:07/06/2024 09:32:35 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: [ZeRO sharding] DP Rank 0 has 72.4M out of 72.4M (100.00%) params' optimizer states [default2]:Traceback (most recent call last): [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in [default2]: trainer = DistributedTrainer(config_file) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 185, in __init__ [default2]: self.optimizer, self.grad_accumulator = init_optimizer_and_grad_accumulator( [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/helpers.py", line 401, in init_optimizer_and_grad_accumulator [default2]: param = model.get_parameter(optim_model_param_name) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 714, in get_parameter [default3]:Traceback (most recent call last): [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in [default3]: trainer = DistributedTrainer(config_file) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 185, in __init__ [default3]: self.optimizer, self.grad_accumulator = init_optimizer_and_grad_accumulator( [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/helpers.py", line 401, in init_optimizer_and_grad_accumulator [default3]: param = model.get_parameter(optim_model_param_name) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 714, in get_parameter [default3]: mod: torch.nn.Module = self.get_submodule(module_path) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 681, in get_submodule [default3]: raise AttributeError(mod._get_name() + " has no " [default3]:AttributeError: PipelineBlock has no attribute `pp_block` [default2]: mod: torch.nn.Module = self.get_submodule(module_path) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 681, in get_submodule [default2]: raise AttributeError(mod._get_name() + " has no " [default2]:AttributeError: PipelineBlock has no attribute `pp_block` [default0]:07/06/2024 09:32:37 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: [Training Plan] Stage Training Stage has 19 remaining training steps and has consumed 0 samples [default0]:07/06/2024 09:32:37 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: Using `datasets` library [default0]:07/06/2024 09:32:37 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: Loading tokenizer from openai-community/gpt2 and transformers/hf_hub versions ('4.41.2', '0.23.4') [default0]:Repo card metadata block was not found. Setting CardData to empty. [default0]:07/06/2024 09:32:37 [WARNING|DP=0|PP=0|TP=0|ip-26-0-163-236]: Repo card metadata block was not found. Setting CardData to empty. [2024-07-06 09:32:37,603] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3389334 closing signal SIGTERM [2024-07-06 09:32:37,603] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3389335 closing signal SIGTERM [2024-07-06 09:32:37,604] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3389336 closing signal SIGTERM [2024-07-06 09:32:37,604] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3389338 closing signal SIGTERM [2024-07-06 09:32:37,604] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3389339 closing signal SIGTERM [2024-07-06 09:32:37,605] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3389340 closing signal SIGTERM [2024-07-06 09:32:37,605] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3389341 closing signal SIGTERM [default0]:07/06/2024 09:32:39 [WARNING|DP=0|PP=12|TP=0|ip-26-0-164-45]: Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default1]:07/06/2024 09:32:39 [WARNING|DP=0|PP=12|TP=1|ip-26-0-164-45]: Repo card metadata block was not found. Setting CardData to empty. [default7]:07/06/2024 09:32:39 [WARNING|DP=0|PP=15|TP=1|ip-26-0-164-45]: Repo card metadata block was not found. Setting CardData to empty. [default3]:07/06/2024 09:32:39 [WARNING|DP=0|PP=13|TP=1|ip-26-0-164-45]: Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default4]:07/06/2024 09:32:39 [WARNING|DP=0|PP=14|TP=0|ip-26-0-164-45]: Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default6]:07/06/2024 09:32:39 [WARNING|DP=0|PP=15|TP=0|ip-26-0-164-45]: Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default6]:07/06/2024 09:32:39 [WARNING|DP=0|PP=11|TP=0|ip-26-0-164-187]: Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default0]:07/06/2024 09:32:39 [WARNING|DP=0|PP=8|TP=0|ip-26-0-164-187]: Repo card metadata block was not found. Setting CardData to empty. [default0]:07/06/2024 09:32:39 [WARNING|DP=0|PP=4|TP=0|ip-26-0-164-18]: Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default4]:07/06/2024 09:32:39 [WARNING|DP=0|PP=6|TP=0|ip-26-0-164-18]: Repo card metadata block was not found. Setting CardData to empty. [default5]:07/06/2024 09:32:39 [WARNING|DP=0|PP=6|TP=1|ip-26-0-164-18]: Repo card metadata block was not found. Setting CardData to empty. [default2]:07/06/2024 09:32:39 [WARNING|DP=0|PP=5|TP=0|ip-26-0-164-18]: Repo card metadata block was not found. Setting CardData to empty. [default1]:07/06/2024 09:32:39 [WARNING|DP=0|PP=4|TP=1|ip-26-0-164-18]: Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default5]:07/06/2024 09:32:39 [WARNING|DP=0|PP=10|TP=1|ip-26-0-164-187]: Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default6]:07/06/2024 09:32:39 [WARNING|DP=0|PP=7|TP=0|ip-26-0-164-18]: Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default7]:07/06/2024 09:32:39 [WARNING|DP=0|PP=11|TP=1|ip-26-0-164-187]: Repo card metadata block was not found. Setting CardData to empty. [default3]:07/06/2024 09:32:39 [WARNING|DP=0|PP=9|TP=1|ip-26-0-164-187]: Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default7]:07/06/2024 09:32:39 [WARNING|DP=0|PP=7|TP=1|ip-26-0-164-18]: Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default2]:07/06/2024 09:32:39 [WARNING|DP=0|PP=13|TP=0|ip-26-0-164-45]: Repo card metadata block was not found. Setting CardData to empty. [default5]:07/06/2024 09:32:39 [WARNING|DP=0|PP=14|TP=1|ip-26-0-164-45]: Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default3]:07/06/2024 09:32:39 [WARNING|DP=0|PP=5|TP=1|ip-26-0-164-18]: Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default4]:07/06/2024 09:32:39 [WARNING|DP=0|PP=10|TP=0|ip-26-0-164-187]: Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default2]:07/06/2024 09:32:39 [WARNING|DP=0|PP=9|TP=0|ip-26-0-164-187]: Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default1]:07/06/2024 09:32:39 [WARNING|DP=0|PP=8|TP=1|ip-26-0-164-187]: Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [2024-07-06 09:32:39,828] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 3 (pid: 3389337) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-06_09:32:37 host : ip-26-0-172-116.ec2.internal rank : 51 (local_rank: 3) exitcode : 1 (pid: 3389337) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ srun: error: ip-26-0-172-116: task 6: Exited with exit code 1 [default0]:[rank32]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default0]:[rank32]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down. [default0]:[rank32]:[E ProcessGroupNCCL.cpp:1182] [Rank 32] NCCL watchdog thread terminated with exception: NCCL error: remote process exited or there was a network error, NCCL version 2.19.3 [default0]:ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely. [default0]:Last error: [default0]:socketProgress: Connection closed by remote peer ip-26-0-172-116.ec2.internal<41640> [default0]:Exception raised from checkForNCCLErrorsInternal at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1436 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa97e111d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: c10d::ProcessGroupNCCL::checkForNCCLErrorsInternal(std::vector, std::allocator > > const&) + 0x2f3 (0x7fa97f2b8fa3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: c10d::ProcessGroupNCCL::WorkNCCL::checkAndSetException() + 0x7b (0x7fa97f2b927b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #3: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x17d (0x7fa97f2bcc1d in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #4: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7fa97f2bd839 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #5: + 0xd3e95 (0x7fa9c8fc1e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #6: + 0x8609 (0x7fa9ce0c9609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #7: clone + 0x43 (0x7fa9cde94353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default0]:terminate called after throwing an instance of 'c10::DistBackendError' [default0]: what(): [Rank 32] NCCL watchdog thread terminated with exception: NCCL error: remote process exited or there was a network error, NCCL version 2.19.3 [default0]:ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely. [default0]:Last error: [default0]:socketProgress: Connection closed by remote peer ip-26-0-172-116.ec2.internal<41640> [default0]:Exception raised from checkForNCCLErrorsInternal at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1436 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa97e111d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:[rank40]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default0]:[rank40]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down. [default0]:[rank40]:[E ProcessGroupNCCL.cpp:1182] [Rank 40] NCCL watchdog thread terminated with exception: NCCL error: remote process exited or there was a network error, NCCL version 2.19.3 [default0]:ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely. [default0]:Last error: [default0]:socketProgress: Connection closed by remote peer ip-26-0-172-116.ec2.internal<57364> [default0]:Exception raised from checkForNCCLErrorsInternal at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1436 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f06a4acdd87 in /fsx/ferdinandmom/miniforg[default0]:frame #1: c10d::ProcessGroupNCCL::checkForNCCLErrorsInternal(std::vector, std::allocator > > const&) + 0x2f3 (0x7fa97f2b8fa3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: c10d::ProcessGroupNCCL::WorkNCCL::checkAndSetException() + 0x7b (0x7fa97f2b927b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #3: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x17d (0x7fa97f2bcc1d in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #4: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7fa97f2bd839 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #5: + 0xd3e95 (0x7fa9c8fc1e95 in /fsx/ferdinandmom/miniforge3/ene3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: c10d::ProcessGroupNCCL::checkForNCCLErrorsInternal(std::vector, std::allocator > > const&) + 0x2f3 (0x7f06a5c74fa3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: c10d::ProcessGroupNCCL::WorkNCCL::checkAndSetException() + 0x7b (0x7f06a5c7527b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) vs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #6: + 0x8609 (0x7fa9ce0c9609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #7: clone + 0x43 (0x7fa9cde94353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default0]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1186 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa97e111d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: + 0xdf6b11 (0x7fa97f013b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: + 0xd3e95 (0x7fa9c8fc1e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #3: + 0x8609 (0x7fa9ce0c9609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #3: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x17d (0x7f06a5c78c1d in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #4: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f06a5c79839 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #5: + 0xd3e95 (0x7f06ef97de95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #6: + 0x8609 (0x7f06f4a85609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #7: clone + 0x43 (0x7f06f4850353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default0]:terminate called after throwing an instance of 'c10::DistBackendError' [default0]: what(): [Rank 40] NCCL watchdog thread terminated with exception: NCCL error: remote process exited or there was a network error, NCCL version 2.19.3 [default0]:ncclRemoteError[default0]:frame #4: clone + 0x43 (0x7fa9cde94353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: : A call failed possibly due to a network error or a remote process exiting prematurely. [default0]:Last error: [default0]:socketProgress: Connection closed by remote peer ip-26-0-172-116.ec2.internal<57364> [default0]:Exception raised from checkForNCCLErrorsInternal at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1436 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f06a4acdd87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: c10d::ProcessGroupNCCL::checkForNCCLErrorsInternal(std::vector, std::allocator > > const&) + 0x2f3 (0x7f06a5c74fa3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: c10d::ProcessGroupNCCL::WorkNCCL::checkAndSetException() + 0x7b (0x7f06a5c7527b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #3: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x17d (0x7f06a5c78c1d in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #4: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f06a5c79839 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #5: + 0xd3e95 (0x7f06ef97de95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #6: + 0x8609 (0x7f06f4a85609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #7: clone + 0x43 (0x7f06f4850353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default0]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1186 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f06a4acdd87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: + 0xdf6b11 (0x7f06a59cfb11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: + 0xd3e95 (0x7f06ef97de95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #3: + 0x8609 (0x7f06f4a85609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #4: clone + 0x43 (0x7f06f4850353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default0]:[rank56]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default0]:[rank56]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down. [default0]:[rank56]:[E ProcessGroupNCCL.cpp:1182] [Rank 56] NCCL watchdog thread terminated with exception: NCCL error: remote process exited or there was a network error, NCCL version 2.19.3 [default0]:ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely. [default0]:Last error: [default0]:socketProgress: Connection closed by remote peer ip-26-0-172-116.ec2.internal<58950> [default0]:Exception raised from checkForNCCLErrorsInternal at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1436 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc209116d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: c10d::ProcessGroupNCCL::checkForNCCLErrorsInternal(std::vector, std::allocator > > const&) + 0x2f3 (0x7fc20a2bdfa3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: c10d::ProcessGroupNCCL::WorkNCCL::checkAndSetException() + 0x7b (0x7fc20a2be27b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #3: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x17d (0x7fc20a2c1c1d in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #4: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7fc20a2c2839 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #5: + 0xd3e95 (0x7fc253fc6e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #6: + 0x8609 (0x7fc2590ce609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #7: clone + 0x43 (0x7fc258e99353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default0]:terminate called after throwing an instance of 'c10::DistBackendError' [default0]: what(): [Rank 56] NCCL watchdog thread terminated with exception: NCCL error: remote process exited or there was a network error, NCCL version 2.19.3 [default0]:ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely. [default0]:Last error: [default0]:socketProgress: Connection closed by remote peer ip-26-0-172-116.ec2.internal<58950> [default0]:Exception raised from checkForNCCLErrorsInternal at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1436 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc209116d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: c10d::ProcessGroupNCCL::checkForNCCLErrorsInternal(std::vector, std::allocator > > const&) + 0x2f3 (0x7fc20a2bdfa3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: c10d::ProcessGroupNCCL::WorkNCCL::checkAndSetException() + 0x7b (0x7fc20a2be27b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #3: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x17d (0x7fc20a2c1c1d in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #4: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7fc20a2c2839 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #5: + 0xd3e95 (0x7fc253fc6e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #6: + 0x8609 (0x7fc2590ce609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #7: clone + 0x43 (0x7fc258e99353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default0]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1186 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc209116d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: + 0xdf6b11 (0x7fc20a018b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: + 0xd3e95 (0x7fc253fc6e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #3: + 0x8609 (0x7fc2590ce609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #4: clone + 0x43 (0x7fc258e99353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default3]:Repo card metadata block was not found. Setting CardData to empty. [default3]:07/06/2024 09:32:44 [WARNING|DP=0|PP=17|TP=1|ip-26-0-164-75]: Repo card metadata block was not found. Setting CardData to empty. [default6]:07/06/2024 09:32:44 [WARNING|DP=0|PP=19|TP=0|ip-26-0-164-75]: Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default4]:07/06/2024 09:32:44 [WARNING|DP=0|PP=22|TP=0|ip-26-0-168-120]: Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default2]:07/06/2024 09:32:44 [WARNING|DP=0|PP=21|TP=0|ip-26-0-168-120]: Repo card metadata block was not found. Setting CardData to empty. [default1]:07/06/2024 09:32:44 [WARNING|DP=0|PP=20|TP=1|ip-26-0-168-120]: Repo card metadata block was not found. Setting CardData to empty. [default3]:07/06/2024 09:32:44 [WARNING|DP=0|PP=21|TP=1|ip-26-0-168-120]: Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default5]:07/06/2024 09:32:44 [WARNING|DP=0|PP=22|TP=1|ip-26-0-168-120]: Repo card metadata block was not found. Setting CardData to empty. [default7]:07/06/2024 09:32:44 [WARNING|DP=0|PP=31|TP=1|ip-26-0-173-7]: Repo card metadata block was not found. Setting CardData to empty. [default6]:07/06/2024 09:32:44 [WARNING|DP=0|PP=31|TP=0|ip-26-0-173-7]: Repo card metadata block was not found. Setting CardData to empty. [default7]:07/06/2024 09:32:44 [WARNING|DP=0|PP=23|TP=1|ip-26-0-168-120]: Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default3]:07/06/2024 09:32:44 [WARNING|DP=0|PP=29|TP=1|ip-26-0-173-7]: Repo card metadata block was not found. Setting CardData to empty. [default2]:07/06/2024 09:32:44 [WARNING|DP=0|PP=29|TP=0|ip-26-0-173-7]: Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default6]:07/06/2024 09:32:44 [WARNING|DP=0|PP=23|TP=0|ip-26-0-168-120]: Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default4]:07/06/2024 09:32:44 [WARNING|DP=0|PP=30|TP=0|ip-26-0-173-7]: Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [2024-07-06 09:32:47,612] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 79377 closing signal SIGTERM [2024-07-06 09:32:47,612] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 79378 closing signal SIGTERM [2024-07-06 09:32:47,613] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 79379 closing signal SIGTERM [2024-07-06 09:32:47,613] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 79380 closing signal SIGTERM [2024-07-06 09:32:47,614] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 79381 closing signal SIGTERM [2024-07-06 09:32:47,614] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 965624 closing signal SIGTERM [2024-07-06 09:32:47,614] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 965625 closing signal SIGTERM [2024-07-06 09:32:47,614] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 79382 closing signal SIGTERM [2024-07-06 09:32:47,615] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 79383 closing signal SIGTERM [2024-07-06 09:32:47,615] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 965626 closing signal SIGTERM [2024-07-06 09:32:47,615] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 965627 closing signal SIGTERM [2024-07-06 09:32:47,616] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 965628 closing signal SIGTERM [2024-07-06 09:32:47,617] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2777169 closing signal SIGTERM [2024-07-06 09:32:47,616] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 965629 closing signal SIGTERM [2024-07-06 09:32:47,617] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2777170 closing signal SIGTERM [2024-07-06 09:32:47,616] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 965630 closing signal SIGTERM [2024-07-06 09:32:47,617] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2777171 closing signal SIGTERM [2024-07-06 09:32:47,618] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2777172 closing signal SIGTERM [2024-07-06 09:32:47,618] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2777173 closing signal SIGTERM [2024-07-06 09:32:47,619] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2777174 closing signal SIGTERM [2024-07-06 09:32:47,619] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2777175 closing signal SIGTERM [default0]:Using the latest cached version of the dataset since roneneldan/TinyStories couldn't be found on the Hugging Face Hub [default0]:Found the latest cached dataset configuration 'default' at /admin/home/ferdinand_mom/.cache/roneneldan___tiny_stories/default/0.0.0/691b0d9bd48ade766778c940011ca1c549f6359b (last modified on Mon Jun 24 07:59:52 2024). [default0]:07/06/2024 09:32:47 [WARNING|DP=0|PP=0|TP=0|ip-26-0-163-236]: Using the latest cached version of the dataset since roneneldan/TinyStories couldn't be found on the Hugging Face Hub [default0]:07/06/2024 09:32:47 [WARNING|DP=0|PP=0|TP=0|ip-26-0-163-236]: Found the latest cached dataset configuration 'default' at /admin/home/ferdinand_mom/.cache/roneneldan___tiny_stories/default/0.0.0/691b0d9bd48ade766778c940011ca1c549f6359b (last modified on Mon Jun 24 07:59:52 2024). [2024-07-06 09:32:49,736] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 0 (pid: 2777168) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-06_09:32:47 host : ip-26-0-173-7.ec2.internal rank : 56 (local_rank: 0) exitcode : -6 (pid: 2777168) error_file: traceback : Signal 6 (SIGABRT) received by PID 2777168 ============================================================ [2024-07-06 09:32:49,836] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 0 (pid: 79376) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-06_09:32:47 host : ip-26-0-168-120.ec2.internal rank : 40 (local_rank: 0) exitcode : -6 (pid: 79376) error_file: traceback : Signal 6 (SIGABRT) received by PID 79376 ============================================================ [2024-07-06 09:32:50,035] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 0 (pid: 965623) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-06_09:32:47 host : ip-26-0-164-75.ec2.internal rank : 32 (local_rank: 0) exitcode : -6 (pid: 965623) error_file: traceback : Signal 6 (SIGABRT) received by PID 965623 ============================================================ srun: error: ip-26-0-168-120: task 5: Exited with exit code 1 srun: error: ip-26-0-173-7: task 7: Exited with exit code 1 srun: error: ip-26-0-164-75: task 3: Exited with exit code 1 [default0]:07/06/2024 09:32:51 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: [Training Plan] There are 1 training stages [default0]:07/06/2024 09:32:51 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: [Stage Training Stage] start from step 1 [default0]:07/06/2024 09:32:51 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: [default0]:07/06/2024 09:32:51 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: [Start training] datetime: 2024-07-06 09:32:51.187093 | mbs: 8 | grad_accum: 128 | global_batch_size: 1024 | sequence_length: 4096 | train_steps: 20 | start_iteration_step: 0 | consumed_train_samples: 0 [default4]:Repo card metadata block was not found. Setting CardData to empty. [default4]:07/06/2024 09:32:51 [WARNING|DP=0|PP=2|TP=0|ip-26-0-163-236]: Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default7]:07/06/2024 09:32:51 [WARNING|DP=0|PP=3|TP=1|ip-26-0-163-236]: Repo card metadata block was not found. Setting CardData to empty. [default1]:07/06/2024 09:32:51 [WARNING|DP=0|PP=0|TP=1|ip-26-0-163-236]: Repo card metadata block was not found. Setting CardData to empty. [default2]:07/06/2024 09:32:51 [WARNING|DP=0|PP=1|TP=0|ip-26-0-163-236]: Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default5]:07/06/2024 09:32:51 [WARNING|DP=0|PP=2|TP=1|ip-26-0-163-236]: Repo card metadata block was not found. Setting CardData to empty. [default0]:07/06/2024 09:32:51 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: Resuming training from stage Training Stage, it has trained for 0 samples and has 19 remaining train steps [default0]:07/06/2024 09:32:51 [INFO|DP=0|PP=0|TP=0|ip-26-0-163-236]: Memory usage: 691.85MiB. Peak allocated 691.85MiB. Peak reserved: 712.00MiB [default0]:[rank16]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default0]:[rank16]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down. [default0]:[rank16]:[E ProcessGroupNCCL.cpp:1182] [Rank 16] NCCL watchdog thread terminated with exception: NCCL error: remote process exited or there was a network error, NCCL version 2.19.3 [default0]:ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely. [default0]:Last error: [default0]:socketProgress: Connection closed by remote peer ip-26-0-164-75.ec2.internal<50212> [default0]:Exception raised from checkForNCCLErrorsInternal at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1436 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f0e5412dd87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: c10d::ProcessGroupNCCL::checkForNCCLErrorsInternal(std::vector, std::allocator > > const&) + 0x2f3 (0x7f0e552d4fa3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: c10d::ProcessGroupNCCL::WorkNCCL::checkAndSetException() + 0x7b (0x7f0e552d527b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #3: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x17d (0x7f0e552d8c1d in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #4: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f0e552d9839 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #5: + 0xd3e95 (0x7f0e9efdde95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #6: + 0x8609 (0x7f0ea40e5609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #7: clone + 0x43 (0x7f0ea3eb0353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default0]:terminate called after throwing an instance of 'c10::DistBackendError' [default0]: what(): [Rank 16] NCCL watchdog thread terminated with exception: NCCL error: remote process exited or there was a network error, NCCL version 2.19.3 [default0]:ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely. [default0]:Last error: [default0]:socketProgress: Connection closed by remote peer ip-26-0-164-75.ec2.internal<50212> [default0]:Exception raised from checkForNCCLErrorsInternal at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1436 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f0e5412dd87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: c10d::ProcessGroupNCCL::checkForNCCLErrorsInternal(std::vector, std::allocator > > const&) + 0x2f3 (0x7f0e552d4fa3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: c10d::ProcessGroupNCCL::WorkNCCL::checkAndSetException() + 0x7b (0x7f0e552d527b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #3: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x17d (0x7f0e552d8c1d in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #4: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f0e552d9839 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #5: + 0xd3e95 (0x7f0e9efdde95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #6: + 0x8609 (0x7f0ea40e5609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #7: clone + 0x43 (0x7f0ea3eb0353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default0]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1186 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f0e5412dd87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: + 0xdf6b11 (0x7f0e5502fb11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: + 0xd3e95 (0x7f0e9efdde95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #3: + 0x8609 (0x7f0ea40e5609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #4: clone + 0x43 (0x7f0ea3eb0353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default6]:07/06/2024 09:32:56 [WARNING|DP=0|PP=3|TP=0|ip-26-0-163-236]: Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default3]:07/06/2024 09:32:56 [WARNING|DP=0|PP=1|TP=1|ip-26-0-163-236]: Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [2024-07-06 09:32:57,615] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 421122 closing signal SIGTERM [2024-07-06 09:32:57,615] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 421123 closing signal SIGTERM [2024-07-06 09:32:57,616] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 421124 closing signal SIGTERM [2024-07-06 09:32:57,616] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 421125 closing signal SIGTERM [2024-07-06 09:32:57,617] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 421126 closing signal SIGTERM [2024-07-06 09:32:57,617] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 421127 closing signal SIGTERM [2024-07-06 09:32:57,617] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 421128 closing signal SIGTERM [2024-07-06 09:32:59,952] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 0 (pid: 421121) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-06_09:32:57 host : ip-26-0-164-187.ec2.internal rank : 16 (local_rank: 0) exitcode : -6 (pid: 421121) error_file: traceback : Signal 6 (SIGABRT) received by PID 421121 ============================================================ srun: error: ip-26-0-164-187: task 4: Exited with exit code 1 [default0]:[rank0]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default0]:[rank0]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down. [default0]:[rank0]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default0]:[rank0]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down. [default0]:[rank0]:[E ProcessGroupNCCL.cpp:1182] [Rank 0] NCCL watchdog thread terminated with exception: NCCL error: remote process exited or there was a network error, NCCL version 2.19.3 [default0]:ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely. [default0]:Last error: [default0]:socketProgress: Connection closed by remote peer ip-26-0-164-75.ec2.internal<39344> [default0]:Exception raised from checkForNCCLErrorsInternal at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1436 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f1c7c991d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: c10d::ProcessGroupNCCL::checkForNCCLErrorsInternal(std::vector, std::allocator > > const&) + 0x2f3 (0x7f1c7db38fa3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: c10d::ProcessGroupNCCL::WorkNCCL::checkAndSetException() + 0x7b (0x7f1c7db3927b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #3: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x17d (0x7f1c7db3cc1d in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #4: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f1c7db3d839 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #5: + 0xd3e95 (0x7f1cc7841e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #6: + 0x8609 (0x7f1ccc949609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #7: clone + 0x43 (0x7f1ccc714353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default0]:[rank0]:[E ProcessGroupNCCL.cpp:1182] [Rank 0] NCCL watchdog thread terminated with exception: NCCL error: remote process exited or there was a network error, NCCL version 2.19.3 [default0]:ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely. [default0]:Last error: [default0]:socketProgress: Connection closed by remote peer ip-26-0-164-75.ec2.internal<39344> [default0]:Exception raised from checkForNCCLErrorsInternal at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1436 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f1c7c991d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: c10d::ProcessGroupNCCL::checkForNCCLErrorsInternal(std::vector, std::allocator > > const&) + 0x2f3 (0x7f1c7db38fa3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: c10d::ProcessGroupNCCL::WorkNCCL::checkAndSetException() + 0x7b (0x7f1c7db3927b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #3: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x17d (0x7f1c7db3cc1d in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #4: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f1c7db3d839 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #5: + 0xd3e95 (0x7f1cc7841e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #6: + 0x8609 (0x7f1ccc949609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #7: clone + 0x43 (0x7f1ccc714353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default0]:terminate called after throwing an instance of 'c10::DistBackendError' [default0]: what(): [Rank 0] NCCL watchdog thread terminated with exception: NCCL error: remote process exited or there was a network error, NCCL version 2.19.3 [default0]:ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely. [default0]:Last error: [default0]:socketProgress: Connection closed by remote peer ip-26-0-164-75.ec2.internal<39344> [default0]:Exception raised from checkForNCCLErrorsInternal at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1436 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f1c7c991d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: c10d::ProcessGroupNCCL::checkForNCCLErrorsInternal(std::vector, std::allocator > > const&) + 0x2f3 (0x7f1c7db38fa3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: c10d::ProcessGroupNCCL::WorkNCCL::checkAndSetException() + 0x7b (0x7f1c7db3927b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #3: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x17d (0x7f1c7db3cc1d in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #4: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f1c7db3d839 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #5: + 0xd3e95 (0x7f1cc7841e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #6: + 0x8609 (0x7f1ccc949609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #7: clone + 0x43 (0x7f1ccc714353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default0]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1186 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f1c7c991d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: + 0xdf6b11 (0x7f1c7d893b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #2: + 0xd3e95 (0x7f1cc7841e95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default0]:frame #3: + 0x8609 (0x7f1ccc949609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default0]:frame #4: clone + 0x43 (0x7f1ccc714353 in /lib/x86_64-linux-gnu/libc.so.6) [default0]: [default0]:Traceback (most recent call last): [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]: trainer.train(dataloader) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default0]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default0]: outputs = self.pipeline_engine.train_batch_iter( [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default0]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default0]: output = model(**micro_batch) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default0]: return self._call_impl(*args, **kwargs) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default0]: return forward_call(*args, **kwargs) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default0]: sharded_logits = self.model( [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default0]: return self._call_impl(*args, **kwargs) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default0]: return forward_call(*args, **kwargs) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default0]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default0]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default0]: return self._call_impl(*args, **kwargs) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default0]: return forward_call(*args, **kwargs) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default0]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default0]: pipeline_state.run_communication() [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default0]: recv_activation_tensor = recv_activation() [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default0]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default0]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default0]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default0]: dist.recv( [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default0]: return func(*args, **kwargs) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default0]: pg.recv([tensor], group_src_rank, tag).wait() [default0]:torch.distributed.DistBackendError: [12] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '11:12', but store->get('11:12') got error: Connection reset by peer [default0]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd1d86b9d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: + 0x589518e (0x7fd21067318e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7fd21066d9a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fd21066dce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fd21066eb11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fd210623f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fd210623f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fd210623f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fd210623f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fd1d9861c69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7fd1d9868c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7fd1d988bb60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #12: + 0x5838439 (0x7fd210616439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #13: + 0x5843330 (0x7fd210621330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #14: + 0x58433c5 (0x7fd2106213c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #15: + 0x4e893cc (0x7fd20fc673cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #16: + 0x1a08a88 (0x7fd20c7e6a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #17: + 0x5849a84 (0x7fd210627a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #18: + 0x584ed35 (0x7fd21062cd35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #19: + 0xc97eee (0x7fd222edeeee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default0]:frame #20: + 0x413ea4 (0x7fd22265aea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default0]:frame #21: + 0x1445a6 (0x55656df275a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #22: _PyObject_MakeTpCall + 0x26b (0x55656df20a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #23: + 0x150866 (0x55656df33866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55656df1c142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #25: _PyFunction_Vectorcall + 0x6c (0x55656df27a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #26: PyObject_Call + 0xbc (0x55656df33f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55656df1a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #28: _PyFunction_Vectorcall + 0x6c (0x55656df27a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55656df188fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #30: + 0x150582 (0x55656df33582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55656df188fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #32: + 0x150582 (0x55656df33582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55656df188fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #34: + 0x150582 (0x55656df33582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55656df188fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55656df1ff50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #37: _PyObject_Call_Prepend + 0x69 (0x55656df31c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #38: + 0x211239 (0x55656dff4239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #39: _PyObject_MakeTpCall + 0x26b (0x55656df20a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55656df1c3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #41: _PyFunction_Vectorcall + 0x6c (0x55656df27a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:Traceback (most recent call last): [default0]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55656df17c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #43: _PyFunction_Vectorcall + 0x6c (0x55656df27a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:Traceback (most recent call last): [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]: trainer.train(dataloader) [default0]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55656df188fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #45: + 0x150582 (0x55656df33582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #46: PyObject_Call + 0xbc (0x55656df33f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55656df1a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #48: + 0x150582 (0x55656df33582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #49: PyObject_Call + 0xbc (0x55656df33f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55656df1a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #51: _PyFunction_Vectorcall + 0x6c (0x55[default4]: trainer.train(dataloader) 656df27a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55656df20007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #53: _PyObject_Call_Prepend + 0x69 (0x55656df31c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #54: + 0x211239 (0x55656dff4239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #55: PyObject_Call + 0x207 (0x55656df34067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55656df1a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #57: + 0x150582 (0x55656df33582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55656df188fa in /fsx/ferdinandmom/miniforge3/e[default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train nvs/env-bench-cluster/bin/python3.10) [default0]:frame #59: + 0x150582 (0x55656df33582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #60: PyObject_Call + 0xbc (0x55656df33f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55656df1a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default2]:Traceback (most recent call last): [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]: trainer.train(dataloader) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default2]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default0]:frame #62: + 0x150582 (0x55656df33582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #63: PyObject_Call + 0xbc (0x55656df33f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default4]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default2]: outputs = self.pipeline_engine.train_batch_iter( [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default2]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default4]:Traceback (most recent call last): [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]: trainer.train(dataloader) [default4]: outputs = self.pipeline_engine.train_batch_iter( [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default2]: output = model(**micro_batch) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default2]: return self._call_impl(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default2]: return forward_call(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default2]: sharded_logits = self.model( [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default2]: return self._call_impl(*args, **kwargs) [default2]: File "/fsx/f[default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default4]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default4]: outputs = self.pipeline_engine.train_batch_iter( [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default4]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default4]: output = model(**micro_batch) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default4]: [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter erdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default2]: return forward_call(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default2]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default2]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default2]: return self._call_impl(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default2]: return fo return self._call_impl(*args, **kwargs) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default4]: return forward_call(*args, **kwargs) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default4]: sharded_logits = self.model( [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default4]: return self._call_impl(*args, **kwargs) [default7]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]:Traceback (most recent call last): [default4]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) rward_call(*args, **kwargs) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default4]: return forward_call(*args, **kwargs) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default4]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default4]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default4]: return self._call_impl(*args, **kwargs) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl[default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default2]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default2]: pipeline_state.run_communication() [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default2]: recv_activation_tensor = recv_activation() [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default2]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_te [default4]: return forward_call(*args, **kwargs) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default4]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default4]: pipeline_state.run_communication() [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default4]: recv_activation_tensor = recv_activation() [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default4]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/par[default4]: output = model(**micro_batch) nsors [default2]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default2]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default2]: dist.recv( [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default2]: return func(*args, **kwargs) allel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default4]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default4]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default4]: dist.recv( [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default4]: return func(*args, **kwargs) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default4]: pg.recv([tensor], group_src_rank, tag).wait() [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default2]: pg.recv([tensor], group_src_rank, tag).wait() [default2]:torch.distributed.DistBackendError: [5] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '4:5', but store->get('4:5') got error: Connection reset by peer [default2]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f736673fd87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:torch.distributed.DistBackendError: [14] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '13:14', but store->get('13:14') got error: Connection reset by peer [default4]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f78f5ca2d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default2]:frame #1: + 0x589518e (0x7f739e6f918e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #1: + 0x589518e (0x7f792dc5c18e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7f792dc569a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f792dc56ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]: outputs = self.pipeline_engine.train_batch_iter( [default2]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7f739e6f39a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f739e6f3ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f792dc57b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f792dc0cf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f792dc0cf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]: return self._call_impl(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default7]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default2]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f739e6f4b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f739e6a9f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f739e6a9f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f739e6a9f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f739e6a9f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f792dc0cf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f792dc0cf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default2]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f73678e7c69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f78f6e4ac69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7f78f6e51c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default2]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7f73678eec5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7f7367911b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7f78f6e74b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #12: + 0x5838439 (0x7f792dbff439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #13: + 0x5843330 (0x7f792dc0a330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]: trainer.train(dataloader) [default2]:frame #12: + 0x5838439 (0x7f739e69c439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #13: + 0x5843330 (0x7f739e6a7330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #14: + 0x58433c5 (0x7f739e6a73c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #14: + 0x58433c5 (0x7f792dc0a3c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #15: + 0x4e893cc (0x7f792d2503cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #16: + 0x1a08a88 (0x7f7929dcfa88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default2]:frame #15: + 0x4e893cc (0x7f739dced3cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #16: + 0x1a08a88 (0x7f739a86ca88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #17: + 0x5849a84 (0x7f739e6ada84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #17: + 0x5849a84 (0x7f792dc10a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]: return forward_call(*args, **kwargs) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default2]:frame #18: + 0x584ed35 (0x7f739e6b2d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #19: + 0xc97eee (0x7f73b0f64eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default2]:frame #20: + 0x413ea4 (0x7f73b06e0ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default2]:frame #21: + 0x1445a6 (0x55df981055a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #18: + 0x584ed35 (0x7f792dc15d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #19: + 0xc97eee (0x7f79404c7eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default4]:frame #20: + 0x413ea4 (0x7f793fc43ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]: output = model(**micro_batch) [default2]:frame #22: _PyObject_MakeTpCall + 0x26b (0x55df980fea6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #23: + 0x150866 (0x55df98111866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55df980fa142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #21: + 0x1445a6 (0x55e975ace5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #22: _PyObject_MakeTpCall + 0x26b (0x55e975ac7a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:frame #25: _PyFunction_Vectorcall + 0x6c (0x55df98105a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #26: PyObject_Call + 0xbc (0x55df98111f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55df980f82b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #28: _PyFunction_Vectorcall + 0x6c (0x55df98105a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55df980f68fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #23: + 0x150866 (0x55e975ada866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55e975ac3142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #25: _PyFunction_Vectorcall + 0x6c (0x55e975acea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: sharded_logits = self.model( [default2]:frame #30: + 0x150582 (0x55df98111582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55df980f68fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #32: + 0x150582 (0x55df98111582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55df980f68fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #26: PyObject_Call + 0xbc (0x55e975adaf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55e975ac12b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #28: _PyFunction_Vectorcall + 0x6c (0x55e975acea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55e975abf8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #30: + 0x150582 (0x55e975ada582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default2]:frame #34: + 0x150582 (0x55df98111582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55df980f68fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55df980fdf50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55e975abf8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #32: + 0x150582 (0x55e975ada582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55e975abf8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #34: + 0x150582 (0x55e975ada582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default2]:frame #37: _PyObject_Call_Prepend + 0x69 (0x55df9810fc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #38: + 0x211239 (0x55df981d2239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #39: _PyObject_MakeTpCall + 0x26b (0x55df980fea6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55e975abf8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55e975ac6f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #37: _PyObject_Call_Prepend + 0x69 (0x55e975ad8c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default2]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55df980fa3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #41: _PyFunction_Vectorcall + 0x6c (0x55df98105a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55df980f5c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #43: _PyFunction_Vectorcall + 0x6c (0x55df98105a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55df980f68fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #45: + 0x150582 (0x55df98111582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #38: + 0x211239 (0x55e975b9b239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #39: _PyObject_MakeTpCall + 0x26b (0x55e975ac7a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55e975ac33e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: return self._call_impl(*args, **kwargs) [default2]:frame #46: PyObject_Call + 0xbc (0x55df98111f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55df980f82b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #48: + 0x150582 (0x55df98111582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #41: _PyFunction_Vectorcall + 0x6c (0x55e975acea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55e975abec5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #43: _PyFunction_Vectorcall + 0x6c (0x55e975acea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55e975abf8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: return self._call_impl(*args, **kwargs) [default2]:frame #49: PyObject_Call + 0xbc (0x55df98111f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55df980f82b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #51: _PyFunction_Vectorcall + 0x6c (0x55df98105a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #45: + 0x150582 (0x55e975ada582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #46: PyObject_Call + 0xbc (0x55e975adaf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]: outputs = self.pipeline_engine.train_batch_iter( [default2]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55df980fe007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #53: _PyObject_Call_Prepend + 0x69 (0x55df9810fc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #54: + 0x211239 (0x55df981d2239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #55: PyObject_Call + 0x207 (0x55df98112067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55df980f82b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55e975ac12b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #48: + 0x150582 (0x55e975ada582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #49: PyObject_Call + 0xbc (0x55e975adaf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default2]:frame #57: + 0x150582 (0x55df98111582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55df980f68fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #59: + 0x150582 (0x55df98111582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55e975ac12b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #51: _PyFunction_Vectorcall + 0x6c (0x55e975acea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55e975ac7007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #53: _PyObject_Call_Prepend + 0x69 (0x55e975ad8c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: return forward_call(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default2]:frame #60: PyObject_Call + 0xbc (0x55df98111f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55df980f82b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #62: + 0x150582 (0x55df98111582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #63: PyObject_Call + 0xbc (0x55df98111f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #54: + 0x211239 (0x55e975b9b239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #55: PyObject_Call + 0x207 (0x55e975adb067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55e975ac12b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: return forward_call(*args, **kwargs) [default2]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default4]:frame #57: + 0x150582 (0x55e975ada582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55e975abf8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #59: + 0x150582 (0x55e975ada582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default4]:Traceback (most recent call last): [default4]:frame #60: PyObject_Call + 0xbc (0x55e975adaf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55e975ac12b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #62: + 0x150582 (0x55e975ada582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #63: PyObject_Call + 0xbc (0x55e975adaf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]:Traceback (most recent call last): [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]:Traceback (most recent call last): [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]: trainer.train(dataloader) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default6]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default6]: outputs = self.pipeline_engine.train_batch_iter( [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default6]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default6]: output = [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default0]:Traceback (most recent call last): model(**micro_batch) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default6]: return self._call_impl(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default6]: return forward_call(*args, **kwargs) [default5]: output = model(**micro_batch) [default1]:Traceback (most recent call last): [default4]: trainer.train(dataloader) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default6]: sharded_logits = self.model( [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default6]: return self._call_impl(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default6]: return forward_call(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default6]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default6]: hidden_encoder_states = encoder_block(**hidden_[default4]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in encoder_states) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default6]: return self._call_impl(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default6]: return forward_call(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default6]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default6]: pipeline_state.run_communication() [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communic[default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default7]: sharded_logits = self.model( [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in ation [default6]: recv_activation_tensor = recv_activation() [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default5]: return self._call_impl(*args, **kwargs) [default1]: trainer.train(dataloader) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default6]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default6]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default6]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default6]: dist.recv( [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default6]: return func(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-p[default4]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default4]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step ackages/torch/distributed/distributed_c10d.py", line 1706, in recv [default6]: pg.recv([tensor], group_src_rank, tag).wait() [default6]:torch.distributed.DistBackendError: [15] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '14:15', but store->get('14:15') got error: Connection reset by peer [default6]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd791327d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #1: + 0x589518e (0x7fd7c92e118e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7fd7c92db9a0 in /fsx/ferdinandmom/minifor[default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default6]: trainer.train(dataloader) ge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fd7c92dbce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fd7c92dcb11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fd7c9291f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fd7c9291f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fd7c9291f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3[default4]: return self._call_impl(*args, **kwargs) [default0]: trainer.train(dataloader) .10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fd7c9291f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fd7924cfc69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7fd7924d6c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7fd7924f9b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/l[default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]: return forward_call(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default4]: return forward_call(*args, **kwargs) [default4]: outputs = self.pipeline_engine.train_batch_iter( ib/libtorch_cuda.so) [default6]:frame #12: + 0x5838439 (0x7fd7c9284439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #13: + 0x5843330 (0x7fd7c928f330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #14: + 0x58433c5 (0x7fd7c928f3c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #15: + 0x4e893cc (0x7fd7c88d53cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #16: + 0x1a08a88 (0x7fd7c5454a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #17: + 0x5849a84 (0x7fd7c9295a84 in /fsx/ferdinandmom/miniforge3/en[default7]: return self._call_impl(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train vs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default6]:frame #18: + 0x584ed35 (0x7fd7c929ad35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default1]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]:frame #19: + 0xc97eee (0x7fd7dbb4ceee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default6]:frame #20: + 0x413ea4 (0x7fd7db2c8ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default4]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default4]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default6]:frame #21: + 0x1445a6 (0x564eab1385a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: return forward_call(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default0]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]:frame #22: _PyObject_MakeTpCall + 0x26b (0x564eab131a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #23: + 0x150866 (0x564eab144866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x564eab12d142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default6]:frame #25: _PyFunction_Vectorcall + 0x6c (0x564eab138a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #26: PyObject_Call + 0xbc (0x564eab144f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: pipeline_state.run_communication() [default5]: sharded_logits = self.model( [default6]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x564eab12b2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #28: _PyFunction_Vectorcall + 0x6c (0x564eab138a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x564eab1298fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default6]:frame #30: + 0x150582 (0x564eab144582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x564eab1298fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #32: + 0x150582 (0x564eab144582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]: outputs = self.pipeline_engine.train_batch_iter( [default6]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x564eab1298fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #34: + 0x150582 (0x564eab144582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default6]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x564eab1298fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x564eab130f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default4]: recv_activation_tensor = recv_activation() [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default6]:frame #37: _PyObject_Call_Prepend + 0x69 (0x564eab142c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #38: + 0x211239 (0x564eab205239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default4]: output = model(**micro_batch) [default6]:frame #39: _PyObject_MakeTpCall + 0x26b (0x564eab131a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x564eab12d3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #41: _PyFunction_Vectorcall + 0x6c (0x564eab138a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x564eab128c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #43: _PyFunction_Vectorcall + 0x6c (0x564eab138a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x564eab1298fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #45: + 0x150582 (0x564eab144582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #46: PyObject_Call + 0xb[default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default6]: outputs = self.pipeline_engine.train_batch_iter( c (0x564eab144f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x564eab12b2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #48: + 0x150582 (0x564eab144582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #49: PyObject_Call + 0xbc (0x564eab144f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: return self._call_impl(*args, **kwargs) [default4]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default0]: outputs = self.pipeline_engine.train_batch_iter( [default6]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x564eab12b2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #51: _PyFunction_Vectorcall + 0x6c (0x564eab138a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default7]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default6]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x564eab131007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #53: _PyObject_Call_Prepend + 0x69 (0x564eab142c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #54: + 0x211239 (0x564eab205239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #55: PyObject_Call + 0x207 (0x564eab145067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: return forward_call(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default6]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x564eab12b2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #57: + 0x150582 (0x564eab144582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default4]: return self._call_impl(*args, **kwargs) [default6]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x564eab1298fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #59: + 0x150582 (0x564eab144582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #60: PyObject_Call + 0xbc (0x564eab144f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: return self._call_impl(*args, **kwargs) [default6]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default6]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x564eab12b2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #62: + 0x150582 (0x564eab144582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default0]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default6]:frame #63: PyObject_Call + 0xbc (0x564eab144f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default4]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default7]:Traceback (most recent call last): [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]: trainer.train(dataloader) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default7]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default7]: outputs = self.pipeline_engine.train_batch_iter( [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default7]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default7]: output = [default5]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default4]: return forward_call(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward model(**micro_batch) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default7]: return self._call_impl(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default7]: return forward_call(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default7]: sharded_logits = self.model( [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default7]: return self._call_impl(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default7]: return forward_call(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default7]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default7]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/[default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) modules/module.py", line 1511, in _wrapped_call_impl [default7]: return self._call_impl(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default7]: return forward_call(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default7]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default7]: pipeline_state.run_communication() [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default7]: recv_activation_tensor = recv_activation() [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_clust[default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default4]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default6]: output = model(**micro_batch) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward er/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default7]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default7]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default7]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default7]: dist.recv( [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default7]: return func(*args, **kwargs) [default[default7]: return forward_call(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward 7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default7]: pg.recv([tensor], group_src_rank, tag).wait() [default7]:torch.distributed.DistBackendError: [15] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '14:15', but store->get('14:15') got error: Connection reset by peer [default7]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f9403d8ed87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default7]:frame #1: + 0x589518e (0x7f943bd4818e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7f943bd429a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f943bd42ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f943bd43b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f943bcf8f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu[default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default0]: output = model(**micro_batch) .so) [default7]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f943bcf8f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f943bcf8f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f943bcf8f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f9404f36c69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x[default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default4]: sharded_logits = self.model( 7f9404f3dc5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7f9404f60b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #12: + 0x5838439 (0x7f943bceb439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #13: + 0x5843330 (0x7f943bcf6330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #14: + 0x58433c5 (0x7f943bcf63c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #15: + 0x4e893cc (0x7f943b33c3cc in /fsx/ferdinandmom/miniforge3/envs/e[default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default6]: return self._call_impl(*args, **kwargs) nv-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #16: + 0x1a08a88 (0x7f9437ebba88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #17: + 0x5849a84 (0x7f943bcfca84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]: return self._call_impl(*args, **kwargs) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default7]:frame #18: + 0x584ed35 (0x7f943bd01d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #19: + 0xc97eee (0x7f944e5b3eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]:frame #20: + 0x413ea4 (0x7f944dd2fea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]:frame #21: + 0x1445a6 (0x55624c9d55a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #22: _PyObject_MakeTpCall + 0x26b (0x55624c9cea6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #23: + 0x150866 (0x55624c9e1866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55624[default7]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl c9ca142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: dist.recv( [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default0]: return self._call_impl(*args, **kwargs) [default7]:frame #25: _PyFunction_Vectorcall + 0x6c (0x55624c9d5a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #26: PyObject_Call + 0xbc (0x55624c9e1f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55624c9c82b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #28: _PyFunction_Vectorcall + 0x6c (0x55624c9d5a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default4]: return self._call_impl(*args, **kwargs) [default7]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55624c9c68fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #30: + 0x150582 (0x55624c9e1582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55624c9c68fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #32: + 0x150582 (0x55624c9e1582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55624c9c68fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: pipeline_state.run_communication() [default1]: output = model(**micro_batch) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default6]: return forward_call(*args, **kwargs) [default7]:frame #34: + 0x150582 (0x55624c9e1582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55624c9c68fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55624c9cdf50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: return func(*args, **kwargs) [default5]: return forward_call(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default7]:frame #37: _PyObject_Call_Prepend + 0x69 (0x55624c9dfc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #38: + 0x211239 (0x55624caa2239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #39: _PyObject_MakeTpCall + 0x26b (0x55624c9cea6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default0]: return forward_call(*args, **kwargs) [default7]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55624c9ca3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #41: _PyFunction_Vectorcall + 0x6c (0x55624c9d5a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55624c9c5c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #43: _PyFunction_Vectorcall + 0x6c (0x55624c9d5a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default5]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default1]: return self._call_impl(*args, **kwargs) [default7]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55624c9c68fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #45: + 0x150582 (0x55624c9e1582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #46: PyObject_Call + 0xbc (0x55624c9e1f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55624c9c82b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #48: + 0x150582 (0x55624c9e1582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #49: PyObject_Call + 0xbc (0x55624c9e1f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55624c9c82b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #51: _PyFunction_Vectorcall + 0x6c (0x55[default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward 624c9d5a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55624c9ce007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: pg.recv([tensor], group_src_rank, tag).wait() [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default7]:frame #53: _PyObject_Call_Prepend + 0x69 (0x55624c9dfc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #54: + 0x211239 (0x55624caa2239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #55: PyObject_Call + 0x207 (0x55624c9e2067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55624c9c82b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #57: + 0x150582 (0x55624c9e1582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55624c9c68fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default5]: pipeline_state.run_communication() [default0]: sharded_logits = self.model( [default7]:frame #59: + 0x150582 (0x55624c9e1582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #60: PyObject_Call + 0xbc (0x55624c9e1f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55624c9c82b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #62: + 0x150582 (0x55624c9e1582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:torch.distributed.DistBackendError: [2] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '1:2', but store->get('1:2') got error: Connection reset by peer [default6]: sharded_logits = self.model( [default7]:frame #63: PyObject_Call + 0xbc (0x55624c9e1f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default7]: recv_activation_tensor = recv_activation() [default4]: return forward_call(*args, **kwargs) [default3]:Traceback (most recent call last): [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]: trainer.train(dataloader) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default3]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default3]: outputs = self.pipeline_engine.train_batch_iter( [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default3]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default3]: output = [default4]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]: return forward_call(*args, **kwargs) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward model(**micro_batch) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]: return self._call_impl(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default3]: return forward_call(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default3]: sharded_logits = self.model( [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default0]: return self._call_impl(*args, **kwargs) [default3]: return self._call_impl(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default3]: return forward_call(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default3]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]: return self._call_impl(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-package[default5]: recv_activation_tensor = recv_activation() [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl s/torch/nn/modules/module.py", line 1520, in _call_impl [default3]: return forward_call(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default3]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default3]: pipeline_state.run_communication() [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default3]: recv_activation_tensor = recv_activation() [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default3]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default3]: File "/fsx/ferdinandm[default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default7]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default4]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default6]: return self._call_impl(*args, **kwargs) om/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default3]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default3]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default3]: dist.recv( [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default3]: return func(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default3]: pg.recv([tensor], group_src_rank, tag).w[default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default0]: return forward_call(*args, **kwargs) ait() [default5]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default3]:torch.distributed.DistBackendError: [13] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '12:13', but store->get('12:13') got error: Connection reset by peer [default3]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc14fa5ad87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: + 0x589518e (0x7fc187a1418e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7fc187a0e9a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #3: c10d::TCPStore::doGet(st[default7]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default6]: return forward_call(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward d::string const&) + 0x32 (0x7fc187a0ece2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fc187a0fb11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fc1879c4f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fc1879c4f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb0f1862d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default3]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fc1879c4f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fc1879c4f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fc150c02c69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7fc150c09c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7fc150c2cb60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #12: + 0x5838439 (0x7fc1879b7439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #13: + 0x5843330 (0x7fc1879c2330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #1: + 0x589518e (0x7fb12981c18e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default0]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default3]:frame #14: + 0x58433c5 (0x7fc1879c23c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #15: + 0x4e893cc (0x7fc1870083cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #16: + 0x1a08a88 (0x7fc183b87a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #17: + 0x5849a84 (0x7fc1879c8a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]:frame #18: + 0x584ed35 (0x7fc1879cdd35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #19: + 0xc97eee (0x7fc19a27feee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:frame #20: + 0x413ea4 (0x7fc1999fbea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:frame #21: + 0x1445a6 (0x561ac12535a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #22: _PyObject_MakeTpCall + 0x26b (0x561ac124ca6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #23: + 0x150866 (0x561ac125f866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x561ac[default4]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7fb1298169a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states 1248142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #25: _PyFunction_Vectorcall + 0x6c (0x561ac1253a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #26: PyObject_Call + 0xbc (0x561ac125ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x561ac12462b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #28: _PyFunction_Vectorcall + 0x6c (0x561ac1253a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x561ac12448fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #30: + 0x150582 (0x561ac125f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x561ac12448fa in /fsx/ferdinandmom/miniforge3/envs/env[default4]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fb129816ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default4]: hidden_encoder_states = encoder_block(**hidden_encoder_states) -bench-cluster/bin/python3.10) [default3]:frame #32: + 0x150582 (0x561ac125f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x561ac12448fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #34: + 0x150582 (0x561ac125f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x561ac12448fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x561ac124bf50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: dist.recv( [default6]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default3]:frame #37: _PyObject_Call_Prepend + 0x69 (0x561ac125dc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #38: + 0x211239 (0x561ac1320239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #39: _PyObject_MakeTpCall + 0x26b (0x561ac124ca6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x561ac12483e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #41: _PyFunction_Vectorcall + 0x6c (0x561ac1253a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x561ac1243c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fb129817b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]:frame #43: _PyFunction_Vectorcall + 0x6c (0x561ac1253a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x561ac12448fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #45: + 0x150582 (0x561ac125f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #46: PyObject_Call + 0xbc (0x561ac125ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default0]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]: sharded_logits = self.model( [default3]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x561ac12462b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #48: + 0x150582 (0x561ac125f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #49: PyObject_Call + 0xbc (0x561ac125ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x561ac12462b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #51: _PyFunction_Vectorcall + 0x6c (0x561ac1253a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fb1297ccf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]: return self._call_impl(*args, **kwargs) [default3]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x561ac124c007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #53: _PyObject_Call_Prepend + 0x69 (0x561ac125dc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #54: + 0x211239 (0x561ac1320239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #55: PyObject_Call + 0x207 (0x561ac1260067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x561ac12462b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #57: + 0x150582 (0x561ac125f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x561ac12448fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #59: + 0x150582 (0x561ac125f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #60: PyObject_Call + 0xbc (0x561ac125ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fb1297ccf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]: return func(*args, **kwargs) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default3]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x561ac12462b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #62: + 0x150582 (0x561ac125f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #63: PyObject_Call + 0xbc (0x561ac125ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default4]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fb1297ccf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]: return forward_call(*args, **kwargs) [default0]: return self._call_impl(*args, **kwargs) [default5]:Traceback (most recent call last): [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]: trainer.train(dataloader) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default5]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default5]: outputs = self.pipeline_engine.train_batch_iter( [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default7]: pg.recv([tensor], group_src_rank, tag).wait() [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default5]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default5]: output = model(**micro_batch) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default5]: return self._call_impl(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]: return forward_call(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default5]: sharded_logits = self.model( [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default5]: return self._call_impl(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default4]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fb1297ccf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:torch.distributed.DistBackendError: [3] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '2:3', but store->get('2:3') got error: Connection reset by peer [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default6]: return self._call_impl(*args, **kwargs) [default1]: return self._call_impl(*args, **kwargs) [default5]: return forward_call(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default5]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default5]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default4]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fb0f2a0ac69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]: return self._call_impl(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]: return forward_call(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default7]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default5]: pipeline_state.run_communication() [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default5]: dist.recv( [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default0]: return forward_call(*args, **kwargs) [default5]: recv_activation_tensor = recv_activation() [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default5]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default4]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7fb0f2a11c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]: return forward_call(*args, **kwargs) [default5]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default5]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc7e3538d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]: return forward_call(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default5]: dist.recv( [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default5]: return func(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default5]: pg.recv([tensor], group_src_rank, tag).wait() [default4]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7fb0f2a34b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default5]:torch.distributed.DistBackendError: [14] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '13:14', but store->get('13:14') got error: Connection reset by peer [default5]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa9a07d2d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]: return func(*args, **kwargs) [default7]:frame #1: + 0x589518e (0x7fc81b4f218e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default5]:frame #1: + 0x589518e (0x7fa9d878c18e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7fa9d87869a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default1]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default5]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fa9d8786ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fa9d8787b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fa9d873cf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #12: + 0x5838439 (0x7fb1297bf439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default5]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fa9d873cf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fa9d873cf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fa9d873cf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7fc81b4ec9a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default4]: pipeline_state.run_communication() [default5]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fa9a197ac69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7fa9a1981c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7fa9a19a4b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]: pg.recv([tensor], group_src_rank, tag).wait() [default1]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default5]:frame #12: + 0x5838439 (0x7fa9d872f439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #13: + 0x5843330 (0x7fa9d873a330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #14: + 0x58433c5 (0x7fa9d873a3c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #15: + 0x4e893cc (0x7fa9d7d803cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #13: + 0x5843330 (0x7fb1297ca330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]: pipeline_state.run_communication() [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default6]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default5]:frame #16: + 0x1a08a88 (0x7fa9d48ffa88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #17: + 0x5849a84 (0x7fa9d8740a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fc81b4ecce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default5]:frame #18: + 0x584ed35 (0x7fa9d8745d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #19: + 0xc97eee (0x7fa9eaff7eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]:frame #20: + 0x413ea4 (0x7fa9ea773ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]:frame #21: + 0x1445a6 (0x5638fbb2b5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #22: _PyObject_MakeTpCall + 0x26b (0x5638fbb24a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #14: + 0x58433c5 (0x7fb1297ca3c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:torch.distributed.DistBackendError: [2] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '1:2', but store->get('1:2') got error: Connection reset by peer [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default5]:frame #23: + 0x150866 (0x5638fbb37866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x5638fbb20142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #25: _PyFunction_Vectorcall + 0x6c (0x5638fbb2ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #15: + 0x4e893cc (0x7fb128e103cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default0]: recv_activation_tensor = recv_activation() [default5]:frame #26: PyObject_Call + 0xbc (0x5638fbb37f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5638fbb1e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #28: _PyFunction_Vectorcall + 0x6c (0x5638fbb2ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5638fbb1c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f9e8814dd87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]: pipeline_state.run_communication() [default5]:frame #30: + 0x150582 (0x5638fbb37582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5638fbb1c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #32: + 0x150582 (0x5638fbb37582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5638fbb1c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #16: + 0x1a08a88 (0x7fb12598fa88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default5]:frame #34: + 0x150582 (0x5638fbb37582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5638fbb1c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x5638fbb23f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #37: _PyObject_Call_Prepend + 0x69 (0x5638fbb35c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fc81b4edb11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #1: + 0x589518e (0x7f9ec010718e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]: recv_activation_tensor = recv_activation() [default5]:frame #38: + 0x211239 (0x5638fbbf8239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #39: _PyObject_MakeTpCall + 0x26b (0x5638fbb24a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5638fbb203e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #41: _PyFunction_Vectorcall + 0x6c (0x5638fbb2ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fc81b4a2f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default5]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x5638fbb1bc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #43: _PyFunction_Vectorcall + 0x6c (0x5638fbb2ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5638fbb1c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7f9ec01019a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default5]:frame #45: + 0x150582 (0x5638fbb37582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #46: PyObject_Call + 0xbc (0x5638fbb37f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5638fbb1e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #48: + 0x150582 (0x5638fbb37582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fc81b4a2f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default1]: return self._call_impl(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]:frame #49: PyObject_Call + 0xbc (0x5638fbb37f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5638fbb1e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #51: _PyFunction_Vectorcall + 0x6c (0x5638fbb2ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x5638fbb24007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #17: + 0x5849a84 (0x7fb1297d0a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f9ec0101ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default5]:frame #53: _PyObject_Call_Prepend + 0x69 (0x5638fbb35c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #54: + 0x211239 (0x5638fbbf8239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #55: PyObject_Call + 0x207 (0x5638fbb38067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fc81b4a2f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default1]: return forward_call(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default5]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5638fbb1e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #57: + 0x150582 (0x5638fbb37582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5638fbb1c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #59: + 0x150582 (0x5638fbb37582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fc81b4a2f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default5]:frame #60: PyObject_Call + 0xbc (0x5638fbb37f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5638fbb1e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #62: + 0x150582 (0x5638fbb37582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #63: PyObject_Call + 0xbc (0x5638fbb37f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fc7e46e0c69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7fc7e46e7c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default1]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default5]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default7]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7fc7e470ab60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default1]:Traceback (most recent call last): [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]: trainer.train(dataloader) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default1]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default1]: outputs = self.pipeline_engine.train_batch_iter( [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default1]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]: output = [default7]:frame #12: + 0x5838439 (0x7fc81b495439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication model(**micro_batch) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]: return self._call_impl(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]: return forward_call(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default7]:frame #13: + 0x5843330 (0x7fc81b4a0330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default1]: sharded_logits = self.model( [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]: return self._call_impl(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]: return forward_call(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default1]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default1]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/[default7]:frame #14: + 0x58433c5 (0x7fc81b4a03c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #15: + 0x4e893cc (0x7fc81aae63cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]: recv_activation_tensor = recv_activation() modules/module.py", line 1511, in _wrapped_call_impl [default1]: return self._call_impl(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]: return forward_call(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default1]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default1]: pipeline_state.run_communication() [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default1]: recv_activation_tensor = recv_activation() [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_clust[default7]:frame #16: + 0x1a08a88 (0x7fc817665a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default0]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default4]: meta = self._recv_meta(from_rank=from_rank, tag=tag) er/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default1]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default1]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default1]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default7]:frame #17: + 0x5849a84 (0x7fc81b4a6a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default1]: dist.recv( [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default1]: return func(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default1]: pg.recv([tensor], group_src_rank, tag).wait() [default1]:torch.distributed.DistBackendError: [12] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '11:12', but store->get('11:12') got error: Connection reset by peer [default1]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default5]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f9ec0102b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f282cc62d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:frame #1: + 0x589518e (0x7f2864c1c18e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7f2864c169a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f9ec00b7f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]: pipeline_state.run_communication() [default1]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f2864c16ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f2864c17b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2864bccf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f9ec00b7f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default1]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2864bccf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2864bccf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2864bccf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f9ec00b7f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]: dist.recv( [default1]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f282de0ac69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7f282de11c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7f282de34b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #12: + 0x5838439 (0x7f2864bbf439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #13: +[default5]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f9ec00b7f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f9e892f5c69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors 0x5843330 (0x7f2864bca330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7f9e892fcc5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default1]:frame #14: + 0x58433c5 (0x7f2864bca3c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #15: + 0x4e893cc (0x7f28642103cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #16: + 0x1a08a88 (0x7f2860d8fa88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #17: + 0x5849a84 (0x7f2864bd0a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #18: + 0x584ed35 (0x7fc81b4abd35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default1]:frame #18: + 0x584ed35 (0x7f2864bd5d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7f9e8931fb60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]: dist.recv( [default6]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default1]:frame #19: + 0xc97eee (0x7f2877487eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default1]:frame #20: + 0x413ea4 (0x7f2876c03ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default1]:frame #21: + 0x1445a6 (0x56001e0b75a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #22: _PyObject_MakeTpCall + 0x26b (0x56001e0b0a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #18: + 0x584ed35 (0x7fb1297d5d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #19: + 0xc97eee (0x7fc82dd5deee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default4]: return func(*args, **kwargs) [default1]:frame #23: + 0x150866 (0x56001e0c3866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x56001e0ac142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #25: _PyFunction_Vectorcall + 0x6c (0x56001e0b7a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #26: PyObject_Call + 0xbc (0x56001e0c3f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #19: + 0xc97eee (0x7fb13c087eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default1]: recv_activation_tensor = recv_activation() [default1]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x56001e0aa2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #28: _PyFunction_Vectorcall + 0x6c (0x56001e0b7a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x56001e0a88fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #20: + 0x413ea4 (0x7fc82d4d9ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default1]:frame #30: + 0x150582 (0x56001e0c3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x56001e0a88fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #32: + 0x150582 (0x56001e0c3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #12: + 0x5838439 (0x7f9ec00aa439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default1]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x56001e0a88fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #21: + 0x1445a6 (0x560e7d5065a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #20: + 0x413ea4 (0x7fb13b803ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default0]: return func(*args, **kwargs) [default1]:frame #34: + 0x150582 (0x56001e0c3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x56001e0a88fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x56001e0aff50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #13: + 0x5843330 (0x7f9ec00b5330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #14: + 0x58433c5 (0x7f9ec00b53c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]: pg.recv([tensor], group_src_rank, tag).wait() [default1]:frame #37: _PyObject_Call_Prepend + 0x69 (0x56001e0c1c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #38: + 0x211239 (0x56001e184239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #39: _PyObject_MakeTpCall + 0x26b (0x56001e0b0a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x56001e0ac3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #15: + 0x4e893cc (0x7f9ebf6fb3cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default1]:frame #41: _PyFunction_Vectorcall + 0x6c (0x56001e0b7a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x56001e0a7c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #43: _PyFunction_Vectorcall + 0x6c (0x56001e0b7a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x56001e0a88fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #22: _PyObject_MakeTpCall + 0x26b (0x560e7d4ffa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default1]:frame #45: + 0x150582 (0x56001e0c3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #46: PyObject_Call + 0xbc (0x56001e0c3f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x56001e0aa2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #21: + 0x1445a6 (0x5628b15255a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:torch.distributed.DistBackendError: [6] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '5:6', but store->get('5:6') got error: Connection reset by peer [default1]:frame #48: + 0x150582 (0x56001e0c3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #49: PyObject_Call + 0xbc (0x56001e0c3f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x56001e0aa2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #51: _PyFunction_Vectorcall + 0x6c (0x56001e0b7a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #16: + 0x1a08a88 (0x7f9ebc27aa88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default6]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default1]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x56001e0b0007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #53: _PyObject_Call_Prepend + 0x69 (0x56001e0c1c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #54: + 0x211239 (0x56001e184239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #55: PyObject_Call + 0x207 (0x56001e0c4067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #23: + 0x150866 (0x560e7d512866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default1]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x56001e0aa2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #57: + 0x150582 (0x56001e0c3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x56001e0a88fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #17: + 0x5849a84 (0x7f9ec00bba84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #22: _PyObject_MakeTpCall + 0x26b (0x5628b151ea6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default1]:frame #59: + 0x150582 (0x56001e0c3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #60: PyObject_Call + 0xbc (0x56001e0c3f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x56001e0aa2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #62: + 0x150582 (0x56001e0c3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #23: + 0x150866 (0x5628b1531866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fcfb43e9d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default1]:frame #63: PyObject_Call + 0xbc (0x56001e0c3f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default7]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x560e7d4fb142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default2]:Traceback (most recent call last): [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]: trainer.train(dataloader) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default2]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default7]:frame #25: _PyFunction_Vectorcall + 0x6c (0x560e7d506a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: dist.recv( [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default0]: pg.recv([tensor], group_src_rank, tag).wait() [default2]: outputs = self.pipeline_engine.train_batch_iter( [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default2]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default2]: output = model(**micro_batch) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default2]: return self._call_impl(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default7]:frame #26: PyObject_Call + 0xbc (0x560e7d512f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default2]: return forward_call(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default2]: sharded_logits = self.model( [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default2]: return self._call_impl(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default2]: return forward_call(*args, **kwargs) [default7]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x560e7d4f92b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]: dist.recv( [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default2]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default2]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default5]:frame #18: + 0x584ed35 (0x7f9ec00c0d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #1: + 0x589518e (0x7fcfec3a318e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:torch.distributed.DistBackendError: [4] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '3:4', but store->get('3:4') got error: Connection reset by peer [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default2]: return self._call_impl(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default7]:frame #28: _PyFunction_Vectorcall + 0x6c (0x560e7d506a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default2]: return forward_call(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default2]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default2]: pipeline_state.run_communication() [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default2]: recv_activation_tensor = recv_activation() [default5]:frame #19: + 0xc97eee (0x7f9ed2972eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default6]: return func(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default2]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default5]:frame #20: + 0x413ea4 (0x7f9ed20eeea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x560e7d4f78fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x5628b151a142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]: return func(*args, **kwargs) [default2]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default2]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default7]:frame #30: + 0x150582 (0x560e7d512582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default2]: dist.recv( [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default4]:frame #25: _PyFunction_Vectorcall + 0x6c (0x5628b1525a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7fcfec39d9a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]: return func(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default2]: pg.recv([tensor], group_src_rank, tag).wait() [default7]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x560e7d4f78fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default0]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default2]:torch.distributed.DistBackendError: [13] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '12:13', but store->get('12:13') got error: Connection reset by peer [default2]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f3e708f2d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:frame #21: + 0x1445a6 (0x55ee91fb65a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fcfec39dce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #1: + 0x589518e (0x7f3ea88ac18e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7f3ea88a69a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #32: + 0x150582 (0x560e7d512582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: pg.recv([tensor], group_src_rank, tag).wait() [default2]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f3ea88a6ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f3ea88a7b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f3ea885cf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #22: _PyObject_MakeTpCall + 0x26b (0x55ee91fafa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #26: PyObject_Call + 0xbc (0x5628b1531f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5628b15182b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2d1e448d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f3ea885cf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f3ea885cf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f3ea885cf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x560e7d4f78fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]: pg.recv([tensor], group_src_rank, tag).wait() [default1]:torch.distributed.DistBackendError: [4] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '3:4', but store->get('3:4') got error: Connection reset by peer [default4]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fcfec39eb11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f3e71a9ac69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7f3e71aa1c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #34: + 0x150582 (0x560e7d512582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #1: + 0x589518e (0x7f2d5640218e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7f3e71ac4b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #12: + 0x5838439 (0x7f3ea884f439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #13: + 0x5843330 (0x7f3ea885a330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #14: + 0x58433c5 (0x7f3ea885a3c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x560e7d4f78fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #23: + 0x150866 (0x55ee91fc2866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #28: _PyFunction_Vectorcall + 0x6c (0x5628b1525a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x560e7d4fef50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:torch.distributed.DistBackendError: [7] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '6:7', but store->get('6:7') got error: Connection reset by peer [default2]:frame #15: + 0x4e893cc (0x7f3ea7ea03cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #16: + 0x1a08a88 (0x7f3ea4a1fa88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #17: + 0x5849a84 (0x7f3ea8860a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5628b15168fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fcfec353f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7f2d563fc9a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #18: + 0x584ed35 (0x7f3ea8865d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #19: + 0xc97eee (0x7f3ebb117eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default2]:frame #20: + 0x413ea4 (0x7f3eba893ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default2]:frame #21: + 0x1445a6 (0x55cf2c2f95a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #37: _PyObject_Call_Prepend + 0x69 (0x560e7d510c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default2]:frame #22: _PyObject_MakeTpCall + 0x26b (0x55cf2c2f2a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #23: + 0x150866 (0x55cf2c305866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55cf2c2ee142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #38: + 0x211239 (0x560e7d5d3239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #30: + 0x150582 (0x5628b1531582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55ee91fab142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default2]:frame #25: _PyFunction_Vectorcall + 0x6c (0x55cf2c2f9a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #26: PyObject_Call + 0xbc (0x55cf2c305f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55cf2c2ec2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #28: _PyFunction_Vectorcall + 0x6c (0x55cf2c2f9a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #39: _PyObject_MakeTpCall + 0x26b (0x560e7d4ffa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f2d563fcce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55cf2c2ea8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #30: + 0x150582 (0x55cf2c305582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55cf2c2ea8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #32: + 0x150582 (0x55cf2c305582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5628b15168fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7ff0abdaed87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fcfec353f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55cf2c2ea8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #34: + 0x150582 (0x55cf2c305582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55cf2c2ea8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55cf2c2f1f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #32: + 0x150582 (0x5628b1531582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x560e7d4fb3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa57c4dfd87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #37: _PyObject_Call_Prepend + 0x69 (0x55cf2c303c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #38: + 0x211239 (0x55cf2c3c6239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #39: _PyObject_MakeTpCall + 0x26b (0x55cf2c2f2a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55cf2c2ee3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5628b15168fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f2d563fdb11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fcfec353f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #41: _PyFunction_Vectorcall + 0x6c (0x55cf2c2f9a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55cf2c2e9c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #43: _PyFunction_Vectorcall + 0x6c (0x55cf2c2f9a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55cf2c2ea8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #25: _PyFunction_Vectorcall + 0x6c (0x55ee91fb6a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #1: + 0x589518e (0x7fa5b449918e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2d563b2f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #45: + 0x150582 (0x55cf2c305582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #46: PyObject_Call + 0xbc (0x55cf2c305f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55cf2c2ec2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #48: + 0x150582 (0x55cf2c305582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #41: _PyFunction_Vectorcall + 0x6c (0x560e7d506a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fcfec353f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #49: PyObject_Call + 0xbc (0x55cf2c305f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55cf2c2ec2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #51: _PyFunction_Vectorcall + 0x6c (0x55cf2c2f9a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #34: + 0x150582 (0x5628b1531582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #26: PyObject_Call + 0xbc (0x55ee91fc2f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x560e7d4f6c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7fa5b44939a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55cf2c2f2007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #53: _PyObject_Call_Prepend + 0x69 (0x55cf2c303c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #54: + 0x211239 (0x55cf2c3c6239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #55: PyObject_Call + 0x207 (0x55cf2c306067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5628b15168fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #1: + 0x589518e (0x7ff0e3d6818e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55cf2c2ec2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #57: + 0x150582 (0x55cf2c305582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55cf2c2ea8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #59: + 0x150582 (0x55cf2c305582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #43: _PyFunction_Vectorcall + 0x6c (0x560e7d506a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fcfb5591c69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #60: PyObject_Call + 0xbc (0x55cf2c305f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55cf2c2ec2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #62: + 0x150582 (0x55cf2c305582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #63: PyObject_Call + 0xbc (0x55cf2c305f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x5628b151df50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2d563b2f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2d563b2f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default5]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55ee91fa92b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fa5b4493ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #37: _PyObject_Call_Prepend + 0x69 (0x5628b152fc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x560e7d4f78fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #28: _PyFunction_Vectorcall + 0x6c (0x55ee91fb6a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7fcfb5598c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #45: + 0x150582 (0x560e7d512582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2d563b2f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55ee91fa78fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #30: + 0x150582 (0x55ee91fc2582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fa5b4494b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #38: + 0x211239 (0x5628b15f2239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7ff0e3d629a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7fcfb55bbb60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #12: + 0x5838439 (0x7fcfec346439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #39: _PyObject_MakeTpCall + 0x26b (0x5628b151ea6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f2d1f5f0c69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fa5b4449f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55ee91fa78fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #13: + 0x5843330 (0x7fcfec351330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #32: + 0x150582 (0x55ee91fc2582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fa5b4449f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5628b151a3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #14: + 0x58433c5 (0x7fcfec3513c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #41: _PyFunction_Vectorcall + 0x6c (0x5628b1525a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #46: PyObject_Call + 0xbc (0x560e7d512f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55ee91fa78fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7ff0e3d62ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x560e7d4f92b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7ff0e3d63b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #34: + 0x150582 (0x55ee91fc2582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55ee91fa78fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7f2d1f5f7c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #48: + 0x150582 (0x560e7d512582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7f2d1f61ab60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55ee91faef50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #12: + 0x5838439 (0x7f2d563a5439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7ff0e3d18f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #37: _PyObject_Call_Prepend + 0x69 (0x55ee91fc0c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #49: PyObject_Call + 0xbc (0x560e7d512f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x5628b1515c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7ff0e3d18f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fa5b4449f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #13: + 0x5843330 (0x7f2d563b0330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x560e7d4f92b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #38: + 0x211239 (0x55ee92083239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #15: + 0x4e893cc (0x7fcfeb9973cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #16: + 0x1a08a88 (0x7fcfe8516a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #17: + 0x5849a84 (0x7fcfec357a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #51: _PyFunction_Vectorcall + 0x6c (0x560e7d506a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fa5b4449f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #39: _PyObject_MakeTpCall + 0x26b (0x55ee91fafa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7ff0e3d18f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x560e7d4ff007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #43: _PyFunction_Vectorcall + 0x6c (0x5628b1525a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #14: + 0x58433c5 (0x7f2d563b03c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #53: _PyObject_Call_Prepend + 0x69 (0x560e7d510c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7ff0e3d18f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #54: + 0x211239 (0x560e7d5d3239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7ff0acf56c69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #55: PyObject_Call + 0x207 (0x560e7d513067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5628b15168fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55ee91fab3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #15: + 0x4e893cc (0x7f2d559f63cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #45: + 0x150582 (0x5628b1531582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7ff0acf5dc5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #41: _PyFunction_Vectorcall + 0x6c (0x55ee91fb6a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x560e7d4f92b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fa57d687c69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #46: PyObject_Call + 0xbc (0x5628b1531f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #18: + 0x584ed35 (0x7fcfec35cd35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7ff0acf80b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55ee91fa6c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7fa57d68ec5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5628b15182b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #16: + 0x1a08a88 (0x7f2d52575a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #43: _PyFunction_Vectorcall + 0x6c (0x55ee91fb6a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #19: + 0xc97eee (0x7fcffec0eeee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default4]:frame #20: + 0x413ea4 (0x7fcffe38aea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default4]:frame #48: + 0x150582 (0x5628b1531582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7fa57d6b1b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55ee91fa78fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #12: + 0x5838439 (0x7ff0e3d0b439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #49: PyObject_Call + 0xbc (0x5628b1531f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5628b15182b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #13: + 0x5843330 (0x7ff0e3d16330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #21: + 0x1445a6 (0x562d0aa475a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #12: + 0x5838439 (0x7fa5b443c439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #57: + 0x150582 (0x560e7d512582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #14: + 0x58433c5 (0x7ff0e3d163c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #45: + 0x150582 (0x55ee91fc2582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #46: PyObject_Call + 0xbc (0x55ee91fc2f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #51: _PyFunction_Vectorcall + 0x6c (0x5628b1525a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #13: + 0x5843330 (0x7fa5b4447330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #14: + 0x58433c5 (0x7fa5b44473c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x560e7d4f78fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #22: _PyObject_MakeTpCall + 0x26b (0x562d0aa40a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x5628b151e007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #17: + 0x5849a84 (0x7f2d563b6a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55ee91fa92b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #15: + 0x4e893cc (0x7fa5b3a8d3cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #53: _PyObject_Call_Prepend + 0x69 (0x5628b152fc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #54: + 0x211239 (0x5628b15f2239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #23: + 0x150866 (0x562d0aa53866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #59: + 0x150582 (0x560e7d512582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #15: + 0x4e893cc (0x7ff0e335c3cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #16: + 0x1a08a88 (0x7fa5b060ca88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #48: + 0x150582 (0x55ee91fc2582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #55: PyObject_Call + 0x207 (0x5628b1532067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x562d0aa3c142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #49: PyObject_Call + 0xbc (0x55ee91fc2f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #60: PyObject_Call + 0xbc (0x560e7d512f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5628b15182b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #16: + 0x1a08a88 (0x7ff0dfedba88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x560e7d4f92b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #17: + 0x5849a84 (0x7fa5b444da84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #57: + 0x150582 (0x5628b1531582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #25: _PyFunction_Vectorcall + 0x6c (0x562d0aa47a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #26: PyObject_Call + 0xbc (0x562d0aa53f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #62: + 0x150582 (0x560e7d512582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #17: + 0x5849a84 (0x7ff0e3d1ca84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #63: PyObject_Call + 0xbc (0x560e7d512f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x562d0aa3a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55ee91fa92b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #51: _PyFunction_Vectorcall + 0x6c (0x55ee91fb6a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #28: _PyFunction_Vectorcall + 0x6c (0x562d0aa47a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5628b15168fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #18: + 0x584ed35 (0x7fa5b4452d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #59: + 0x150582 (0x5628b1531582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #19: + 0xc97eee (0x7fa5c6d04eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default5]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55ee91faf007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #18: + 0x584ed35 (0x7f2d563bbd35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #60: PyObject_Call + 0xbc (0x5628b1531f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x562d0aa388fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5628b15182b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #53: _PyObject_Call_Prepend + 0x69 (0x55ee91fc0c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #54: + 0x211239 (0x55ee92083239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #18: + 0x584ed35 (0x7ff0e3d21d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #19: + 0xc97eee (0x7ff0f65d3eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]:frame #55: PyObject_Call + 0x207 (0x55ee91fc3067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #20: + 0x413ea4 (0x7ff0f5d4fea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default4]:frame #30: + 0x150582 (0x562d0aa53582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55ee91fa92b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #62: + 0x150582 (0x5628b1531582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #20: + 0x413ea4 (0x7fa5c6480ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]:frame #57: + 0x150582 (0x55ee91fc2582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #19: + 0xc97eee (0x7f2d68c6deee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default4]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x562d0aa388fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55ee91fa78fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #21: + 0x1445a6 (0x55db7f6605a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #59: + 0x150582 (0x55ee91fc2582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #60: PyObject_Call + 0xbc (0x55ee91fc2f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #20: + 0x413ea4 (0x7f2d683e9ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default4]:frame #63: PyObject_Call + 0xbc (0x5628b1531f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #32: + 0x150582 (0x562d0aa53582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55ee91fa92b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #21: + 0x1445a6 (0x55e3c9e605a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default6]:frame #22: _PyObject_MakeTpCall + 0x26b (0x55db7f659a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #62: + 0x150582 (0x55ee91fc2582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #63: PyObject_Call + 0xbc (0x55ee91fc2f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default4]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x562d0aa388fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #22: _PyObject_MakeTpCall + 0x26b (0x55e3c9e59a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:Traceback (most recent call last): [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]: trainer.train(dataloader) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default6]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default6]: outputs = self.pipeline_engine.train_batch_iter( [default6]:frame #23: + 0x150866 (0x55db7f66c866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55db7f655142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default6]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]:frame #23: + 0x150866 (0x55e3c9e6c866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: output = model(**micro_batch) [default6]:frame #25: _PyFunction_Vectorcall + 0x6c (0x55db7f660a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default6]: return self._call_impl(*args, **kwargs) [default1]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55e3c9e55142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #34: + 0x150582 (0x562d0aa53582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default6]:frame #26: PyObject_Call + 0xbc (0x55db7f66cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: return forward_call(*args, **kwargs) [default0]:frame #21: + 0x1445a6 (0x558ad329e5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x562d0aa388fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default1]:frame #25: _PyFunction_Vectorcall + 0x6c (0x55e3c9e60a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: sharded_logits = self.model( [default2]:Traceback (most recent call last): [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default4]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x562d0aa3ff50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: trainer.train(dataloader) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default6]: return self._call_impl(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default6]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55db7f6532b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #26: PyObject_Call + 0xbc (0x55e3c9e6cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default4]:frame #37: _PyObject_Call_Prepend + 0x69 (0x562d0aa51c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: return forward_call(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default0]:frame #22: _PyObject_MakeTpCall + 0x26b (0x558ad3297a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #23: + 0x150866 (0x558ad32aa866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default4]:frame #38: + 0x211239 (0x562d0ab14239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default6]:frame #28: _PyFunction_Vectorcall + 0x6c (0x55db7f660a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default6]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55e3c9e532b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #39: _PyObject_MakeTpCall + 0x26b (0x562d0aa40a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default6]: return self._call_impl(*args, **kwargs) [default4]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x562d0aa3c3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55db7f6518fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default6]: return forward_call(*args, **kwargs) [default4]:frame #41: _PyFunction_Vectorcall + 0x6c (0x562d0aa47a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default1]:frame #28: _PyFunction_Vectorcall + 0x6c (0x55e3c9e60a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default6]:frame #30: + 0x150582 (0x55db7f66c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: pipeline_state.run_communication() [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default0]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x558ad3293142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x562d0aa37c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: recv_activation_tensor = recv_activation() [default6]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55db7f6518fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default1]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55e3c9e518fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #43: _PyFunction_Vectorcall + 0x6c (0x562d0aa47a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default6]:frame #32: + 0x150582 (0x55db7f66c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default1]:frame #30: + 0x150582 (0x55e3c9e6c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default4]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x562d0aa388fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default0]:frame #25: _PyFunction_Vectorcall + 0x6c (0x558ad329ea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default6]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55db7f6518fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55e3c9e518fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: dist.recv( [default0]:frame #26: PyObject_Call + 0xbc (0x558ad32aaf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: outputs = self.pipeline_engine.train_batch_iter( [default4]:frame #45: + 0x150582 (0x562d0aa53582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #32: + 0x150582 (0x55e3c9e6c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default0]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x558ad32912b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default1]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55e3c9e518fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #46: PyObject_Call + 0xbc (0x562d0aa53f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default6]:frame #34: + 0x150582 (0x55db7f66c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default2]: output = model(**micro_batch) [default0]:frame #28: _PyFunction_Vectorcall + 0x6c (0x558ad329ea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: return func(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default4]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x562d0aa3a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: pg.recv([tensor], group_src_rank, tag).wait() [default1]:frame #34: + 0x150582 (0x55e3c9e6c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:torch.distributed.DistBackendError: [3] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '2:3', but store->get('2:3') got error: Connection reset by peer [default6]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default4]:frame #48: + 0x150582 (0x562d0aa53582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default2]: return self._call_impl(*args, **kwargs) [default0]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x558ad328f8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default2]: return forward_call(*args, **kwargs) [default1]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55e3c9e518fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #49: PyObject_Call + 0xbc (0x562d0aa53f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #30: + 0x150582 (0x558ad32aa582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f4a4bb03d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #1: + 0x589518e (0x7f4a83abd18e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55db7f6518fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: sharded_logits = self.model( [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55e3c9e58f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7f4a83ab79a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55db7f658f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x558ad328f8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f4a83ab7ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]: return self._call_impl(*args, **kwargs) [default1]:frame #37: _PyObject_Call_Prepend + 0x69 (0x55e3c9e6ac39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default4]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x562d0aa3a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f4a83ab8b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f4a83a6df81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f4a83a6df81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #32: + 0x150582 (0x558ad32aa582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: return forward_call(*args, **kwargs) [default4]:frame #51: _PyFunction_Vectorcall + 0x6c (0x562d0aa47a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default2]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default2]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default6]:frame #37: _PyObject_Call_Prepend + 0x69 (0x55db7f66ac39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default2]: return self._call_impl(*args, **kwargs) [default1]:frame #38: + 0x211239 (0x55e3c9f2d239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x562d0aa40007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f4a83a6df81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #38: + 0x211239 (0x55db7f72d239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f4a83a6df81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x558ad328f8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f4a4ccabc69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #53: _PyObject_Call_Prepend + 0x69 (0x562d0aa51c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7f4a4ccb2c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default0]:frame #34: + 0x150582 (0x558ad32aa582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #39: _PyObject_MakeTpCall + 0x26b (0x55e3c9e59a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: return forward_call(*args, **kwargs) [default4]:frame #54: + 0x211239 (0x562d0ab14239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default6]:frame #39: _PyObject_MakeTpCall + 0x26b (0x55db7f659a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55e3c9e553e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default4]:frame #55: PyObject_Call + 0x207 (0x562d0aa54067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x562d0aa3a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7f4a4ccd5b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55db7f6553e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #12: + 0x5838439 (0x7f4a83a60439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #41: _PyFunction_Vectorcall + 0x6c (0x55e3c9e60a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #13: + 0x5843330 (0x7f4a83a6b330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #57: + 0x150582 (0x562d0aa53582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #14: + 0x58433c5 (0x7f4a83a6b3c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #15: + 0x4e893cc (0x7f4a830b13cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x558ad328f8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default1]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55e3c9e50c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: pipeline_state.run_communication() [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default1]:frame #43: _PyFunction_Vectorcall + 0x6c (0x55e3c9e60a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #41: _PyFunction_Vectorcall + 0x6c (0x55db7f660a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #16: + 0x1a08a88 (0x7f4a7fc30a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #17: + 0x5849a84 (0x7f4a83a71a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x558ad3296f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: recv_activation_tensor = recv_activation() [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default2]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default4]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x562d0aa388fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55db7f650c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #18: + 0x584ed35 (0x7f4a83a76d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #37: _PyObject_Call_Prepend + 0x69 (0x558ad32a8c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #19: + 0xc97eee (0x7f4a96328eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default1]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55e3c9e518fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #59: + 0x150582 (0x562d0aa53582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default0]:frame #38: + 0x211239 (0x558ad336b239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #20: + 0x413ea4 (0x7f4a95aa4ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default6]:frame #21: + 0x1445a6 (0x5570842b25a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default6]:frame #43: _PyFunction_Vectorcall + 0x6c (0x55db7f660a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default0]:frame #39: _PyObject_MakeTpCall + 0x26b (0x558ad3297a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #22: _PyObject_MakeTpCall + 0x26b (0x5570842aba6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #23: + 0x150866 (0x5570842be866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x5570842a7142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55db7f6518fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #60: PyObject_Call + 0xbc (0x562d0aa53f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x562d0aa3a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #25: _PyFunction_Vectorcall + 0x6c (0x5570842b2a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #45: + 0x150582 (0x55e3c9e6c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #26: PyObject_Call + 0xbc (0x5570842bef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #45: + 0x150582 (0x55db7f66c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #62: + 0x150582 (0x562d0aa53582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default0]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x558ad32933e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default1]:frame #46: PyObject_Call + 0xbc (0x55e3c9e6cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55e3c9e532b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: dist.recv( [default4]:frame #63: PyObject_Call + 0xbc (0x562d0aa53f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default0]:frame #41: _PyFunction_Vectorcall + 0x6c (0x558ad329ea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5570842a52b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #28: _PyFunction_Vectorcall + 0x6c (0x5570842b2a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x558ad328ec5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #48: + 0x150582 (0x55e3c9e6c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #49: PyObject_Call + 0xbc (0x55e3c9e6cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: return func(*args, **kwargs) [default0]:frame #43: _PyFunction_Vectorcall + 0x6c (0x558ad329ea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default6]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5570842a38fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #30: + 0x150582 (0x5570842be582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default1]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55e3c9e532b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5570842a38fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #32: + 0x150582 (0x5570842be582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x558ad328f8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5570842a38fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #34: + 0x150582 (0x5570842be582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5570842a38fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #45: + 0x150582 (0x558ad32aa582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x5570842aaf50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #51: _PyFunction_Vectorcall + 0x6c (0x55e3c9e60a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #37: _PyObject_Call_Prepend + 0x69 (0x5570842bcc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55e3c9e59007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #38: + 0x211239 (0x55708437f239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #39: _PyObject_MakeTpCall + 0x26b (0x5570842aba6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5570842a73e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #46: PyObject_Call + 0xbc (0x55db7f66cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: pg.recv([tensor], group_src_rank, tag).wait() [default0]:frame #46: PyObject_Call + 0xbc (0x558ad32aaf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #53: _PyObject_Call_Prepend + 0x69 (0x55e3c9e6ac39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #41: _PyFunction_Vectorcall + 0x6c (0x5570842b2a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55db7f6532b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x558ad32912b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x5570842a2c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #54: + 0x211239 (0x55e3c9f2d239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #55: PyObject_Call + 0x207 (0x55e3c9e6d067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #43: _PyFunction_Vectorcall + 0x6c (0x5570842b2a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5570842a38fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #48: + 0x150582 (0x558ad32aa582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #49: PyObject_Call + 0xbc (0x558ad32aaf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #45: + 0x150582 (0x5570842be582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #46: PyObject_Call + 0xbc (0x5570842bef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55e3c9e532b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #57: + 0x150582 (0x55e3c9e6c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5570842a52b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #48: + 0x150582 (0x5570842be582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55e3c9e518fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #49: PyObject_Call + 0xbc (0x5570842bef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5570842a52b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #48: + 0x150582 (0x55db7f66c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #51: _PyFunction_Vectorcall + 0x6c (0x5570842b2a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #59: + 0x150582 (0x55e3c9e6c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #60: PyObject_Call + 0xbc (0x55e3c9e6cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x5570842ab007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #53: _PyObject_Call_Prepend + 0x69 (0x5570842bcc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x558ad32912b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #51: _PyFunction_Vectorcall + 0x6c (0x558ad329ea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #54: + 0x211239 (0x55708437f239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default6]:frame #55: PyObject_Call + 0x207 (0x5570842bf067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5570842a52b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55e3c9e532b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #49: PyObject_Call + 0xbc (0x55db7f66cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default0]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x558ad3297007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f3070c72d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #53: _PyObject_Call_Prepend + 0x69 (0x558ad32a8c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55db7f6532b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #62: + 0x150582 (0x55e3c9e6c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #1: + 0x589518e (0x7f30a8c2c18e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #54: + 0x211239 (0x558ad336b239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #51: _PyFunction_Vectorcall + 0x6c (0x55db7f660a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #57: + 0x150582 (0x5570842be582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #55: PyObject_Call + 0x207 (0x558ad32ab067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5570842a38fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #59: + 0x150582 (0x5570842be582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55db7f659007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #60: PyObject_Call + 0xbc (0x5570842bef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5570842a52b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #63: PyObject_Call + 0xbc (0x55e3c9e6cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #62: + 0x150582 (0x5570842be582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #63: PyObject_Call + 0xbc (0x5570842bef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x558ad32912b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #53: _PyObject_Call_Prepend + 0x69 (0x55db7f66ac39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7f30a8c269a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f30a8c26ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default2]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f30a8c27b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #54: + 0x211239 (0x55db7f72d239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f30a8bdcf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f30a8bdcf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #57: + 0x150582 (0x558ad32aa582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f30a8bdcf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f30a8bdcf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x558ad328f8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #55: PyObject_Call + 0x207 (0x55db7f66d067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f3071e1ac69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #59: + 0x150582 (0x558ad32aa582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7f3071e21c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55db7f6532b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7f3071e44b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #57: + 0x150582 (0x55db7f66c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #60: PyObject_Call + 0xbc (0x558ad32aaf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #12: + 0x5838439 (0x7f30a8bcf439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55db7f6518fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #59: + 0x150582 (0x55db7f66c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #13: + 0x5843330 (0x7f30a8bda330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #60: PyObject_Call + 0xbc (0x55db7f66cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x558ad32912b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #14: + 0x58433c5 (0x7f30a8bda3c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #15: + 0x4e893cc (0x7f30a82203cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #62: + 0x150582 (0x558ad32aa582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #63: PyObject_Call + 0xbc (0x558ad32aaf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #16: + 0x1a08a88 (0x7f30a4d9fa88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #17: + 0x5849a84 (0x7f30a8be0a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default6]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default6]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55db7f6532b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #18: + 0x584ed35 (0x7f30a8be5d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #19: + 0xc97eee (0x7f30bb497eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default2]:frame #20: + 0x413ea4 (0x7f30bac13ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default6]:frame #62: + 0x150582 (0x55db7f66c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #21: + 0x1445a6 (0x55de83c085a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #22: _PyObject_MakeTpCall + 0x26b (0x55de83c01a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #23: + 0x150866 (0x55de83c14866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55de83bfd142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #63: PyObject_Call + 0xbc (0x55db7f66cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default2]:frame #25: _PyFunction_Vectorcall + 0x6c (0x55de83c08a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:Traceback (most recent call last): [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]: trainer.train(dataloader) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default3]:Traceback (most recent call last): [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]:frame #26: PyObject_Call + 0xbc (0x55de83c14f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default7]: outputs = self.pipeline_engine.train_batch_iter( [default2]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55de83bfb2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #28: _PyFunction_Vectorcall + 0x6c (0x55de83c08a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: trainer.train(dataloader) [default3]:Traceback (most recent call last): [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55de83bf98fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:frame #30: + 0x150582 (0x55de83c14582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55de83bf98fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #32: + 0x150582 (0x55de83c14582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55de83bf98fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default3]: trainer.train(dataloader) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default2]:frame #34: + 0x150582 (0x55de83c14582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55de83bf98fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55de83c00f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]: outputs = self.pipeline_engine.train_batch_iter( [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default2]:frame #37: _PyObject_Call_Prepend + 0x69 (0x55de83c12c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default3]: outputs = self.pipeline_engine.train_batch_iter( [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default2]:frame #38: + 0x211239 (0x55de83cd5239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default2]:frame #39: _PyObject_MakeTpCall + 0x26b (0x55de83c01a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: output = model(**micro_batch) [default2]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55de83bfd3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #41: _PyFunction_Vectorcall + 0x6c (0x55de83c08a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default2]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55de83bf8c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: return self._call_impl(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default2]:frame #43: _PyFunction_Vectorcall + 0x6c (0x55de83c08a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default3]: output = model(**micro_batch) [default3]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]: return self._call_impl(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default2]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55de83bf98fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #45: + 0x150582 (0x55de83c14582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #46: PyObject_Call + 0xbc (0x55de83c14f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55de83bfb2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: return forward_call(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default2]:frame #48: + 0x150582 (0x55de83c14582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #49: PyObject_Call + 0xbc (0x55de83c14f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55de83bfb2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: return forward_call(*args, **kwargs) [default3]: output = model(**micro_batch) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default7]: sharded_logits = self.model( [default2]:frame #51: _PyFunction_Vectorcall + 0x6c (0x55de83c08a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default3]: sharded_logits = self.model( [default2]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55de83c01007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default2]:frame #53: _PyObject_Call_Prepend + 0x69 (0x55de83c12c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #54: + 0x211239 (0x55de83cd5239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default7]: return self._call_impl(*args, **kwargs) [default2]:frame #55: PyObject_Call + 0x207 (0x55de83c15067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55de83bfb2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]: return self._call_impl(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default3]: return self._call_impl(*args, **kwargs) [default3]: return forward_call(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default7]: return forward_call(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default2]:frame #57: + 0x150582 (0x55de83c14582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55de83bf98fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default2]:frame #59: + 0x150582 (0x55de83c14582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #60: PyObject_Call + 0xbc (0x55de83c14f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55de83bfb2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default3]: sharded_logits = self.model( [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]: return self._call_impl(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default3]: return forward_call(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default7]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default7]: return self._call_impl(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]: return forward_call(*args, **kwargs) [default3]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]: return self._call_impl(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default7]: return forward_call(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default3]: return forward_call(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default3]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default7]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default7]: pipeline_state.run_communication() [default3]: pipeline_state.run_communication() [default2]:frame #62: + 0x150582 (0x55de83c14582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default2]:frame #63: PyObject_Call + 0xbc (0x55de83c14f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default2]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default3]: recv_activation_tensor = recv_activation() [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default7]: recv_activation_tensor = recv_activation() [default3]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default3]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default3]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default3]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]: return self._call_impl(*args, **kwargs) [default3]: dist.recv( [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default3]: return func(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default3]: pg.recv([tensor], group_src_rank, tag).wait() [default3]: return forward_call(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default3]:torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default3]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default7]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f53ec2ded87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: + 0x589518e (0x7f542429818e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default3]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7f54242929a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f5424292ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default7]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default3]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f5424293b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f5424248f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f5424248f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f5424248f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default3]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f5424248f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f53ed486c69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default3]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7f53ed48dc5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7f53ed4b0b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #12: + 0x5838439 (0x7f542423b439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default3]:frame #13: + 0x5843330 (0x7f5424246330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #14: + 0x58433c5 (0x7f54242463c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #15: + 0x4e893cc (0x7f542388c3cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #16: + 0x1a08a88 (0x7f542040ba88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #17: + 0x5849a84 (0x7f542424ca84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]: pipeline_state.run_communication() [default3]:frame #18: + 0x584ed35 (0x7f5424251d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #19: + 0xc97eee (0x7f5436b03eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:frame #20: + 0x413ea4 (0x7f543627fea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default3]:frame #21: + 0x1445a6 (0x55d3cb3465a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #22: _PyObject_MakeTpCall + 0x26b (0x55d3cb33fa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #23: + 0x150866 (0x55d3cb352866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default3]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55d3cb33b142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #25: _PyFunction_Vectorcall + 0x6c (0x55d3cb346a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #26: PyObject_Call + 0xbc (0x55d3cb352f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55d3cb3392b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: recv_activation_tensor = recv_activation() [default7]: dist.recv( [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default3]:frame #28: _PyFunction_Vectorcall + 0x6c (0x55d3cb346a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55d3cb3378fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #30: + 0x150582 (0x55d3cb352582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55d3cb3378fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: return func(*args, **kwargs) [default3]:frame #32: + 0x150582 (0x55d3cb352582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55d3cb3378fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #34: + 0x150582 (0x55d3cb352582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55d3cb3378fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55d3cb33ef50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default7]: pg.recv([tensor], group_src_rank, tag).wait() [default3]:frame #37: _PyObject_Call_Prepend + 0x69 (0x55d3cb350c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #38: + 0x211239 (0x55d3cb413239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #39: _PyObject_MakeTpCall + 0x26b (0x55d3cb33fa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55d3cb33b3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #41: _PyFunction_Vectorcall + 0x6c (0x55d3cb346a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default3]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55d3cb336c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #43: _PyFunction_Vectorcall + 0x6c (0x55d3cb346a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55d3cb3378fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #45: + 0x150582 (0x55d3cb352582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default3]:frame #46: PyObject_Call + 0xbc (0x55d3cb352f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55d3cb3392b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #48: + 0x150582 (0x55d3cb352582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #49: PyObject_Call + 0xbc (0x55d3cb352f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55d3cb3392b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default3]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default3]:frame #51: _PyFunction_Vectorcall + 0x6c (0x55d3cb346a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55d3cb33f007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #53: _PyObject_Call_Prepend + 0x69 (0x55d3cb350c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #54: + 0x211239 (0x55d3cb413239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default3]:frame #55: PyObject_Call + 0x207 (0x55d3cb353067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55d3cb3392b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #57: + 0x150582 (0x55d3cb352582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55d3cb3378fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #59: + 0x150582 (0x55d3cb352582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default3]:frame #60: PyObject_Call + 0xbc (0x55d3cb352f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55d3cb3392b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #62: + 0x150582 (0x55d3cb352582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #63: PyObject_Call + 0xbc (0x55d3cb352f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: dist.recv( [default3]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default3]: return func(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default7]:torch.distributed.DistBackendError: [7] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '6:7', but store->get('6:7') got error: Connection reset by peer [default7]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default3]: pg.recv([tensor], group_src_rank, tag).wait() [default3]:torch.distributed.DistBackendError: [5] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '4:5', but store->get('4:5') got error: Connection reset by peer [default3]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f6372f72d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: + 0x589518e (0x7f63aaf2c18e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd0bf084d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:frame #1: + 0x589518e (0x7fd0f703e18e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7f63aaf269a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7fd0f70389a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f63aaf26ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fd0f7038ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f63aaf27b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fd0f7039b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fd0f6feef81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fd0f6feef81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f63aaedcf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fd0f6feef81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fd0f6feef81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fd0c022cc69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7fd0c0233c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7fd0c0256b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #12: + 0x5838439 (0x7fd0f6fe1439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f63aaedcf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f63aaedcf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #13: + 0x5843330 (0x7fd0f6fec330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #14: + 0x58433c5 (0x7fd0f6fec3c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #15: + 0x4e893cc (0x7fd0f66323cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #16: + 0x1a08a88 (0x7fd0f31b1a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f63aaedcf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #17: + 0x5849a84 (0x7fd0f6ff2a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f637411ac69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7f6374121c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7f6374144b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #12: + 0x5838439 (0x7f63aaecf439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #13: + 0x5843330 (0x7f63aaeda330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #14: + 0x58433c5 (0x7f63aaeda3c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #15: + 0x4e893cc (0x7f63aa5203cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #16: + 0x1a08a88 (0x7f63a709fa88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #18: + 0x584ed35 (0x7fd0f6ff7d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #17: + 0x5849a84 (0x7f63aaee0a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #19: + 0xc97eee (0x7fd1098a9eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]:frame #20: + 0x413ea4 (0x7fd109025ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:frame #18: + 0x584ed35 (0x7f63aaee5d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #21: + 0x1445a6 (0x55f4a8f9c5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #22: _PyObject_MakeTpCall + 0x26b (0x55f4a8f95a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #19: + 0xc97eee (0x7f63bd797eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]:frame #23: + 0x150866 (0x55f4a8fa8866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #20: + 0x413ea4 (0x7f63bcf13ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55f4a8f91142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #25: _PyFunction_Vectorcall + 0x6c (0x55f4a8f9ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #26: PyObject_Call + 0xbc (0x55f4a8fa8f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #21: + 0x1445a6 (0x564b264ec5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #22: _PyObject_MakeTpCall + 0x26b (0x564b264e5a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #23: + 0x150866 (0x564b264f8866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55f4a8f8f2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #28: _PyFunction_Vectorcall + 0x6c (0x55f4a8f9ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x564b264e1142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #25: _PyFunction_Vectorcall + 0x6c (0x564b264eca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #26: PyObject_Call + 0xbc (0x564b264f8f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x564b264df2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #28: _PyFunction_Vectorcall + 0x6c (0x564b264eca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55f4a8f8d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x564b264dd8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #30: + 0x150582 (0x564b264f8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #30: + 0x150582 (0x55f4a8fa8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x564b264dd8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #32: + 0x150582 (0x564b264f8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x564b264dd8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #34: + 0x150582 (0x564b264f8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x564b264dd8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55f4a8f8d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #32: + 0x150582 (0x55f4a8fa8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x564b264e4f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #37: _PyObject_Call_Prepend + 0x69 (0x564b264f6c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #38: + 0x211239 (0x564b265b9239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55f4a8f8d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #39: _PyObject_MakeTpCall + 0x26b (0x564b264e5a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x564b264e13e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #41: _PyFunction_Vectorcall + 0x6c (0x564b264eca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #34: + 0x150582 (0x55f4a8fa8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x564b264dcc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #43: _PyFunction_Vectorcall + 0x6c (0x564b264eca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x564b264dd8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #45: + 0x150582 (0x564b264f8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #46: PyObject_Call + 0xbc (0x564b264f8f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x564b264df2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #48: + 0x150582 (0x564b264f8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #49: PyObject_Call + 0xbc (0x564b264f8f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x564b264df2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #51: _PyFunction_Vectorcall + 0x6c (0x564b264eca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x564b264e5007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #53: _PyObject_Call_Prepend + 0x69 (0x564b264f6c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #54: + 0x211239 (0x564b265b9239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #55: PyObject_Call + 0x207 (0x564b264f9067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x564b264df2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #57: + 0x150582 (0x564b264f8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x564b264dd8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #59: + 0x150582 (0x564b264f8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #60: PyObject_Call + 0xbc (0x564b264f8f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x564b264df2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #62: + 0x150582 (0x564b264f8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55f4a8f8d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #63: PyObject_Call + 0xbc (0x564b264f8f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55f4a8f94f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #37: _PyObject_Call_Prepend + 0x69 (0x55f4a8fa6c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #38: + 0x211239 (0x55f4a9069239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #39: _PyObject_MakeTpCall + 0x26b (0x55f4a8f95a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55f4a8f913e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default7]:frame #41: _PyFunction_Vectorcall + 0x6c (0x55f4a8f9ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55f4a8f8cc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #43: _PyFunction_Vectorcall + 0x6c (0x55f4a8f9ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55f4a8f8d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #45: + 0x150582 (0x55f4a8fa8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #46: PyObject_Call + 0xbc (0x55f4a8fa8f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55f4a8f8f2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #48: + 0x150582 (0x55f4a8fa8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #49: PyObject_Call + 0xbc (0x55f4a8fa8f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55f4a8f8f2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #51: _PyFunction_Vectorcall + 0x6c (0x55f4a8f9ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55f4a8f95007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #53: _PyObject_Call_Prepend + 0x69 (0x55f4a8fa6c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #54: + 0x211239 (0x55f4a9069239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #55: PyObject_Call + 0x207 (0x55f4a8fa9067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55f4a8f8f2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #57: + 0x150582 (0x55f4a8fa8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55f4a8f8d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #59: + 0x150582 (0x55f4a8fa8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #60: PyObject_Call + 0xbc (0x55f4a8fa8f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55f4a8f8f2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #62: + 0x150582 (0x55f4a8fa8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #63: PyObject_Call + 0xbc (0x55f4a8fa8f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default5]:Traceback (most recent call last): [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]: trainer.train(dataloader) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default5]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default5]: outputs = self.pipeline_engine.train_batch_iter( [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default5]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default5]: output = model(**micro_batch) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default5]: return self._call_impl(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]: return forward_call(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default5]: sharded_logits = self.model( [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default5]: return self._call_impl(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]: return forward_call(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default5]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default5]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default5]: return self._call_impl(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]: return forward_call(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default5]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default5]: pipeline_state.run_communication() [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default5]: recv_activation_tensor = recv_activation() [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default5]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default5]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default5]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default5]: dist.recv( [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default5]: return func(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default5]: pg.recv([tensor], group_src_rank, tag).wait() [default5]:torch.distributed.DistBackendError: [6] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '5:6', but store->get('5:6') got error: Connection reset by peer [default5]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f4711fdcd87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:frame #1: + 0x589518e (0x7f4749f9618e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7f4749f909a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f4749f90ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f4749f91b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f4749f46f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f4749f46f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f4749f46f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f4749f46f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f4713184c69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7f471318bc5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7f47131aeb60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #12: + 0x5838439 (0x7f4749f39439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #13: + 0x5843330 (0x7f4749f44330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #14: + 0x58433c5 (0x7f4749f443c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #15: + 0x4e893cc (0x7f474958a3cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #16: + 0x1a08a88 (0x7f4746109a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #17: + 0x5849a84 (0x7f4749f4aa84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #18: + 0x584ed35 (0x7f4749f4fd35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #19: + 0xc97eee (0x7f475c801eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]:frame #20: + 0x413ea4 (0x7f475bf7dea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]:frame #21: + 0x1445a6 (0x55b7814c75a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #22: _PyObject_MakeTpCall + 0x26b (0x55b7814c0a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #23: + 0x150866 (0x55b7814d3866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55b7814bc142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #25: _PyFunction_Vectorcall + 0x6c (0x55b7814c7a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #26: PyObject_Call + 0xbc (0x55b7814d3f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55b7814ba2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #28: _PyFunction_Vectorcall + 0x6c (0x55b7814c7a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55b7814b88fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #30: + 0x150582 (0x55b7814d3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55b7814b88fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #32: + 0x150582 (0x55b7814d3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55b7814b88fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #34: + 0x150582 (0x55b7814d3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55b7814b88fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55b7814bff50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #37: _PyObject_Call_Prepend + 0x69 (0x55b7814d1c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #38: + 0x211239 (0x55b781594239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #39: _PyObject_MakeTpCall + 0x26b (0x55b7814c0a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55b7814bc3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #41: _PyFunction_Vectorcall + 0x6c (0x55b7814c7a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55b7814b7c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #43: _PyFunction_Vectorcall + 0x6c (0x55b7814c7a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55b7814b88fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #45: + 0x150582 (0x55b7814d3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #46: PyObject_Call + 0xbc (0x55b7814d3f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55b7814ba2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #48: + 0x150582 (0x55b7814d3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #49: PyObject_Call + 0xbc (0x55b7814d3f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55b7814ba2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #51: _PyFunction_Vectorcall + 0x6c (0x55b7814c7a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55b7814c0007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #53: _PyObject_Call_Prepend + 0x69 (0x55b7814d1c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #54: + 0x211239 (0x55b781594239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #55: PyObject_Call + 0x207 (0x55b7814d4067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55b7814ba2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #57: + 0x150582 (0x55b7814d3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55b7814b88fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #59: + 0x150582 (0x55b7814d3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #60: PyObject_Call + 0xbc (0x55b7814d3f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55b7814ba2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #62: + 0x150582 (0x55b7814d3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #63: PyObject_Call + 0xbc (0x55b7814d3f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:. This may indicate a possible application crash on rank 0 or a network set up issue. [2024-07-06 09:33:12,629] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1017333 closing signal SIGTERM [2024-07-06 09:33:12,761] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1408970) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 [2024-07-06 09:33:12,776] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 959055) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-07-06_09:33:12 host : ip-26-0-164-18.ec2.internal rank : 9 (local_rank: 1) exitcode : 1 (pid: 959056) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-07-06_09:33:12 host : ip-26-0-164-18.ec2.internal rank : 10 (local_rank: 2) exitcode : 1 (pid: 959057) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2024-07-06_09:33:12 host : ip-26-0-164-18.ec2.internal rank : 11 (local_rank: 3) exitcode : 1 (pid: 959058) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [4]: time : 2024-07-06_09:33:12 host : ip-26-0-164-18.ec2.internal rank : 12 (local_rank: 4) exitcode : 1 (pid: 959059) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [5]: time : 2024-07-06_09:33:12 host : ip-26-0-164-18.ec2.internal rank : 13 (local_rank: 5) exitcode : 1 (pid: 959060) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [6]: time : 2024-07-06_09:33:12 host : ip-26-0-164-18.ec2.internal rank : 14 (local_rank: 6) exitcode : 1 (pid: 959061) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [7]: time : 2024-07-06_09:33:12 host : ip-26-0-164-18.ec2.internal rank : 15 (local_rank: 7) exitcode : 1 (pid: 959062) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-06_09:33:12 host : ip-26-0-164-18.ec2.internal rank : 8 (local_rank: 0) exitcode : 1 (pid: 959055) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-07-06_09:33:12 host : ip-26-0-164-45.ec2.internal rank : 25 (local_rank: 1) exitcode : 1 (pid: 1408972) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-07-06_09:33:12 host : ip-26-0-164-45.ec2.internal rank : 26 (local_rank: 2) exitcode : 1 (pid: 1408973) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2024-07-06_09:33:12 host : ip-26-0-164-45.ec2.internal rank : 27 (local_rank: 3) exitcode : 1 (pid: 1408974) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [4]: time : 2024-07-06_09:33:12 host : ip-26-0-164-45.ec2.internal rank : 28 (local_rank: 4) exitcode : 1 (pid: 1408975) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [5]: time : 2024-07-06_09:33:12 host : ip-26-0-164-45.ec2.internal rank : 29 (local_rank: 5) exitcode : 1 (pid: 1408976) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [6]: time : 2024-07-06_09:33:12 host : ip-26-0-164-45.ec2.internal rank : 30 (local_rank: 6) exitcode : 1 (pid: 1408977) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [7]: time : 2024-07-06_09:33:12 host : ip-26-0-164-45.ec2.internal rank : 31 (local_rank: 7) exitcode : 1 (pid: 1408978) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-06_09:33:12 host : ip-26-0-164-45.ec2.internal rank : 24 (local_rank: 0) exitcode : 1 (pid: 1408970) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ srun: error: ip-26-0-164-45: task 2: Exited with exit code 1 srun: error: ip-26-0-164-18: task 1: Exited with exit code 1 [2024-07-06 09:33:13,669] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 0 (pid: 1017332) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-07-06_09:33:12 host : ip-26-0-163-236.ec2.internal rank : 2 (local_rank: 2) exitcode : 1 (pid: 1017334) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-07-06_09:33:12 host : ip-26-0-163-236.ec2.internal rank : 3 (local_rank: 3) exitcode : 1 (pid: 1017335) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2024-07-06_09:33:12 host : ip-26-0-163-236.ec2.internal rank : 4 (local_rank: 4) exitcode : 1 (pid: 1017336) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [4]: time : 2024-07-06_09:33:12 host : ip-26-0-163-236.ec2.internal rank : 5 (local_rank: 5) exitcode : 1 (pid: 1017337) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [5]: time : 2024-07-06_09:33:12 host : ip-26-0-163-236.ec2.internal rank : 6 (local_rank: 6) exitcode : 1 (pid: 1017338) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [6]: time : 2024-07-06_09:33:12 host : ip-26-0-163-236.ec2.internal rank : 7 (local_rank: 7) exitcode : 1 (pid: 1017339) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-06_09:33:12 host : ip-26-0-163-236.ec2.internal rank : 0 (local_rank: 0) exitcode : -6 (pid: 1017332) error_file: traceback : Signal 6 (SIGABRT) received by PID 1017332 ============================================================ srun: error: ip-26-0-163-236: task 0: Exited with exit code 1 Consider using `hf_transfer` for faster uploads. This solution comes with some limitations. See https://huggingface.co/docs/huggingface_hub/hf_transfer for more details.