3outeille's picture
3outeille HF staff
Upload llama-1B/8_GPUS/dp-4_tp-1_pp-2_mbz-16
8a7f4ec verified
raw
history blame
114 kB
========================
START TIME: Wed Jul 3 22:47:58 UTC 2024
python3 version = Python 3.10.14
========================
The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /admin/home/ferdinand_mom/.cache/huggingface/token
Login successful
Already on 'bench_cluster'
M examples/config_tiny_llama.py
M examples/config_tiny_llama.yaml
M examples/train_tiny_llama.sh
M src/nanotron/models/llama.py
M src/nanotron/trainer.py
Your branch is up to date with 'origin/bench_cluster'.
Job status: RUNNING
W0703 22:48:00.819000 140318128015168 torch/distributed/run.py:757]
W0703 22:48:00.819000 140318128015168 torch/distributed/run.py:757] *****************************************
W0703 22:48:00.819000 140318128015168 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0703 22:48:00.819000 140318128015168 torch/distributed/run.py:757] *****************************************
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: Config:
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: Config(general=GeneralArgs(project='bench_cluster',
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: run='%date_%jobid',
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: seed=42,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: step=None,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: consumed_train_samples=None,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: benchmark_csv_path=None,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: ignore_sanity_checks=True),
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: parallelism=ParallelismArgs(dp=4,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: pp=2,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: tp=1,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: pp_engine=<nanotron.parallel.pipeline_parallel.engine.OneForwardOneBackwardPipelineEngine object at 0x7fdb966fc730>,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: tp_mode=<TensorParallelLinearMode.REDUCE_SCATTER: 2>,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: tp_linear_async_communication=False,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: expert_parallel_size=1),
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: model=ModelArgs(model_config=LlamaConfig(bos_token_id=1,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: eos_token_id=2,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: hidden_act='silu',
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: hidden_size=2048,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: initializer_range=0.02,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: intermediate_size=4096,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: is_llama_config=True,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: max_position_embeddings=4096,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: num_attention_heads=32,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: num_hidden_layers=24,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: num_key_value_heads=32,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: pad_token_id=None,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: pretraining_tp=1,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: rms_norm_eps=1e-05,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: rope_scaling=None,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: rope_theta=10000.0,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: tie_word_embeddings=True,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: use_cache=True,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: vocab_size=50257),
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: init_method=RandomInit(std=0.025),
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: dtype=torch.bfloat16,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: make_vocab_size_divisible_by=1,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: ddp_bucket_cap_mb=25),
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: tokenizer=TokenizerArgs(tokenizer_name_or_path='openai-community/gpt2',
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: tokenizer_revision=None,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: tokenizer_max_length=None),
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: checkpoints=CheckpointsArgs(checkpoints_path=Path('/dev/null'),
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: checkpoint_interval=100000,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: save_initial_state=False,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: resume_checkpoint_path=None,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: checkpoints_path_is_shared_file_system=False),
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: logging=LoggingArgs(log_level='info',
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: log_level_replica='info',
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: iteration_step_info_interval=1),
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: tokens=TokensArgs(sequence_length=4096,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: train_steps=20,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: micro_batch_size=16,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: batch_accumulation_per_replica=16,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: val_check_interval=-1,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: limit_val_batches=0,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: limit_test_batches=0),
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: optimizer=OptimizerArgs(optimizer_factory=AdamWOptimizerArgs(adam_eps=1e-08,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: adam_beta1=0.9,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: adam_beta2=0.95,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: torch_adam_is_fused=True,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: name='adamW'),
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: zero_stage=1,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: weight_decay=0.01,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: clip_grad=1.0,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: accumulate_grad_in_fp32=True,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: learning_rate_scheduler=LRSchedulerArgs(learning_rate=0.0001,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: lr_warmup_steps=1,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: lr_warmup_style='linear',
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: lr_decay_style='linear',
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: lr_decay_steps=19,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: lr_decay_starting_step=None,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: min_decay_lr=1e-05)),
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: data_stages=[DatasetStageArgs(name='Training Stage',
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: start_training_step=1,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: data=DataArgs(dataset=PretrainDatasetsArgs(hf_dataset_or_datasets='roneneldan/TinyStories',
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: hf_dataset_splits='train',
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: hf_dataset_config_name=None,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: dataset_processing_num_proc_per_process=64,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: dataset_overwrite_cache=False,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: text_column_name='text'),
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: seed=42,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: num_loading_workers=0))],
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: profiler=ProfilerArgs(profiler_export_path=Path('/fsx/ferdinandmom/ferdinand-hf/bench_cluster/results/llama-1B/8_GPUS/dp-4_tp-1_pp-2_mbz-16')),
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: lighteval=None)
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: Model Config:
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: LlamaConfig(bos_token_id=1,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: eos_token_id=2,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: hidden_act='silu',
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: hidden_size=2048,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: initializer_range=0.02,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: intermediate_size=4096,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: is_llama_config=True,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: max_position_embeddings=4096,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: num_attention_heads=32,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: num_hidden_layers=24,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: num_key_value_heads=32,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: pad_token_id=None,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: pretraining_tp=1,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: rms_norm_eps=1e-05,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: rope_scaling=None,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: rope_theta=10000.0,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: tie_word_embeddings=True,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: use_cache=True,
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: vocab_size=50257)
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: Building model..
[default0]:07/03/2024 22:48:17 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: Setting PP block ranks...
[default5]:07/03/2024 22:48:28 [INFO|DP=1|PP=1|TP=0|ip-26-0-161-178]: No checkpoint path provided.
[default2]:07/03/2024 22:48:28 [INFO|DP=2|PP=0|TP=0|ip-26-0-161-178]: No checkpoint path provided.
[default6]:07/03/2024 22:48:28 [INFO|DP=2|PP=1|TP=0|ip-26-0-161-178]: No checkpoint path provided.
[default0]:07/03/2024 22:48:28 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: Total number of parameters: 1.21G (2312.82MiB)
[default0]:07/03/2024 22:48:28 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: Local number of parameters: 690M (1316.43MiB)
[default0]:07/03/2024 22:48:28 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: [After model building] Memory usage: 1330.44MiB. Peak allocated: 1332.47MiB Peak reserved: 1364.00MiB
[default3]:07/03/2024 22:48:28 [INFO|DP=3|PP=0|TP=0|ip-26-0-161-178]: No checkpoint path provided.
[default0]:07/03/2024 22:48:28 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: No checkpoint path provided.
[default0]:07/03/2024 22:48:28 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: Parametrizing model parameters using StandardParametrizator
[default1]:07/03/2024 22:48:28 [INFO|DP=1|PP=0|TP=0|ip-26-0-161-178]: No checkpoint path provided.
[default4]:07/03/2024 22:48:28 [INFO|DP=0|PP=1|TP=0|ip-26-0-161-178]: Local number of parameters: 522M (996.40MiB)
[default4]:07/03/2024 22:48:28 [INFO|DP=0|PP=1|TP=0|ip-26-0-161-178]: [After model building] Memory usage: 1006.41MiB. Peak allocated: 1008.44MiB Peak reserved: 1032.00MiB
[default4]:07/03/2024 22:48:28 [INFO|DP=0|PP=1|TP=0|ip-26-0-161-178]: No checkpoint path provided.
[default7]:07/03/2024 22:48:28 [INFO|DP=3|PP=1|TP=0|ip-26-0-161-178]: No checkpoint path provided.
[default0]:07/03/2024 22:48:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: [Optimizer Building] Using LearningRateForSP as learning rate
[default0]:07/03/2024 22:48:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: [ZeRO sharding] Size of optimizer params per rank:
[default0]:07/03/2024 22:48:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: [ZeRO sharding] DP Rank 0 has 173M out of 690M (25.00%) params' optimizer states
[default0]:07/03/2024 22:48:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: [ZeRO sharding] DP Rank 1 has 173M out of 690M (25.00%) params' optimizer states
[default0]:07/03/2024 22:48:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: [ZeRO sharding] DP Rank 2 has 173M out of 690M (25.00%) params' optimizer states
[default0]:07/03/2024 22:48:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: [ZeRO sharding] DP Rank 3 has 173M out of 690M (25.00%) params' optimizer states
[default0]:07/03/2024 22:48:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: [Training Plan] Stage Training Stage has 19 remaining training steps and has consumed 0 samples
[default0]:07/03/2024 22:48:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: Using `datasets` library
[default0]:07/03/2024 22:48:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: Loading tokenizer from openai-community/gpt2 and transformers/hf_hub versions ('4.41.2', '0.23.4')
[default0]:Repo card metadata block was not found. Setting CardData to empty.
[default0]:07/03/2024 22:48:33 [WARNING|DP=0|PP=0|TP=0|ip-26-0-161-178]: Repo card metadata block was not found. Setting CardData to empty.
[default0]:07/03/2024 22:48:34 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: [Training Plan] There are 1 training stages
[default0]:07/03/2024 22:48:34 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: [Stage Training Stage] start from step 1
[default0]:07/03/2024 22:48:34 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]:
[default0]:07/03/2024 22:48:34 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: [Start training] datetime: 2024-07-03 22:48:34.443897 | mbs: 16 | grad_accum: 16 | global_batch_size: 1024 | sequence_length: 4096 | train_steps: 20 | start_iteration_step: 0 | consumed_train_samples: 0
[default0]:07/03/2024 22:48:34 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: Resuming training from stage Training Stage, it has trained for 0 samples and has 19 remaining train steps
[default0]:07/03/2024 22:48:34 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: Memory usage: 4621.51MiB. Peak allocated 4621.51MiB. Peak reserved: 4658.00MiB
[default4]:07/03/2024 22:48:34 [WARNING|DP=0|PP=1|TP=0|ip-26-0-161-178]: Repo card metadata block was not found. Setting CardData to empty.
[default7]:07/03/2024 22:48:34 [WARNING|DP=3|PP=1|TP=0|ip-26-0-161-178]: Repo card metadata block was not found. Setting CardData to empty.
[default4]:Repo card metadata block was not found. Setting CardData to empty.
[default7]:Repo card metadata block was not found. Setting CardData to empty.
[default1]:Repo card metadata block was not found. Setting CardData to empty.
[default5]:07/03/2024 22:48:34 [WARNING|DP=1|PP=1|TP=0|ip-26-0-161-178]: Repo card metadata block was not found. Setting CardData to empty.
[default2]:07/03/2024 22:48:34 [WARNING|DP=2|PP=0|TP=0|ip-26-0-161-178]: Repo card metadata block was not found. Setting CardData to empty.
[default6]:07/03/2024 22:48:34 [WARNING|DP=2|PP=1|TP=0|ip-26-0-161-178]: Repo card metadata block was not found. Setting CardData to empty.
[default5]:Repo card metadata block was not found. Setting CardData to empty.
[default3]:07/03/2024 22:48:34 [WARNING|DP=3|PP=0|TP=0|ip-26-0-161-178]: Repo card metadata block was not found. Setting CardData to empty.
[default3]:Repo card metadata block was not found. Setting CardData to empty.
[default1]:07/03/2024 22:48:34 [WARNING|DP=1|PP=0|TP=0|ip-26-0-161-178]: Repo card metadata block was not found. Setting CardData to empty.
[default2]:Repo card metadata block was not found. Setting CardData to empty.
[default6]:Repo card metadata block was not found. Setting CardData to empty.
[default0]:[rank0]: Traceback (most recent call last):
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in <module>
[default0]:[rank0]: trainer.train(dataloader)
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train
[default0]:[rank0]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader)
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step
[default0]:[rank0]: outputs = self.pipeline_engine.train_batch_iter(
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter
[default0]:[rank0]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model)
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward
[default0]:[rank0]: output = model(**micro_batch)
[default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default0]:[rank0]: return self._call_impl(*args, **kwargs)
[default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default0]:[rank0]: return forward_call(*args, **kwargs)
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward
[default0]:[rank0]: sharded_logits = self.model(
[default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default0]:[rank0]: return self._call_impl(*args, **kwargs)
[default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default0]:[rank0]: return forward_call(*args, **kwargs)
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward
[default0]:[rank0]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0]
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states
[default0]:[rank0]: hidden_encoder_states = encoder_block(**hidden_encoder_states)
[default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default0]:[rank0]: return self._call_impl(*args, **kwargs)
[default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default0]:[rank0]: return forward_call(*args, **kwargs)
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward
[default0]:[rank0]: output = self.pp_block(**new_kwargs)
[default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default0]:[rank0]: return self._call_impl(*args, **kwargs)
[default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default0]:[rank0]: return forward_call(*args, **kwargs)
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward
[default0]:[rank0]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask)
[default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default0]:[rank0]: return self._call_impl(*args, **kwargs)
[default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default0]:[rank0]: return forward_call(*args, **kwargs)
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 360, in forward
[default0]:[rank0]: qkv_states = self.qkv_proj(
[default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default0]:[rank0]: return self._call_impl(*args, **kwargs)
[default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default0]:[rank0]: return forward_call(*args, **kwargs)
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 87, in forward
[default0]:[rank0]: return column_linear(
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 359, in column_linear
[default0]:[rank0]: return F.linear(input, weight, bias)
[default0]:[rank0]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 768.00 MiB. GPU
[default3]:[rank3]: Traceback (most recent call last):
[default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in <module>
[default3]:[rank3]: trainer.train(dataloader)
[default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train
[default3]:[rank3]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader)
[default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step
[default3]:[rank3]: outputs = self.pipeline_engine.train_batch_iter(
[default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter
[default3]:[rank3]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model)
[default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward
[default3]:[rank3]: output = model(**micro_batch)
[default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default3]:[rank3]: return self._call_impl(*args, **kwargs)
[default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default3]:[rank3]: return forward_call(*args, **kwargs)
[default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward
[default3]:[rank3]: sharded_logits = self.model(
[default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default3]:[rank3]: return self._call_impl(*args, **kwargs)
[default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default3]:[rank3]: return forward_call(*args, **kwargs)
[default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward
[default3]:[rank3]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0]
[default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states
[default3]:[rank3]: hidden_encoder_states = encoder_block(**hidden_encoder_states)
[default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default3]:[rank3]: return self._call_impl(*args, **kwargs)
[default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default3]:[rank3]: return forward_call(*args, **kwargs)
[default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward
[default3]:[rank3]: output = self.pp_block(**new_kwargs)
[default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default3]:[rank3]: return self._call_impl(*args, **kwargs)
[default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default3]:[rank3]: return forward_call(*args, **kwargs)
[default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward
[default3]:[rank3]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask)
[default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default3]:[rank3]: return self._call_impl(*args, **kwargs)
[default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default3]:[rank3]: return forward_call(*args, **kwargs)
[default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 389, in forward
[default3]:[rank3]: .contiguous()
[default3]:[rank3]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 768.00 MiB. GPU  has a total capacity of 79.33 GiB of which 43.94 MiB is free. Including non-PyTorch memory, this process has 79.28 GiB memory in use. Of the allocated memory 67.66 GiB is allocated by PyTorch, and 400.66 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[default1]:[rank1]: Traceback (most recent call last):
[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in <module>
[default1]:[rank1]: trainer.train(dataloader)
[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train
[default1]:[rank1]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader)
[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step
[default1]:[rank1]: outputs = self.pipeline_engine.train_batch_iter(
[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter
[default1]:[rank1]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model)
[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward
[default1]:[rank1]: output = model(**micro_batch)
[default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default1]:[rank1]: return self._call_impl(*args, **kwargs)
[default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default1]:[rank1]: return forward_call(*args, **kwargs)
[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward
[default1]:[rank1]: sharded_logits = self.model(
[default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default1]:[rank1]: return self._call_impl(*args, **kwargs)
[default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default1]:[rank1]: return forward_call(*args, **kwargs)
[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward
[default1]:[rank1]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0]
[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states
[default1]:[rank1]: hidden_encoder_states = encoder_block(**hidden_encoder_states)
[default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default1]:[rank1]: return self._call_impl(*args, **kwargs)
[default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default1]:[rank1]: return forward_call(*args, **kwargs)
[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward
[default1]:[rank1]: output = self.pp_block(**new_kwargs)
[default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default1]:[rank1]: return self._call_impl(*args, **kwargs)
[default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default1]:[rank1]: return forward_call(*args, **kwargs)
[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward
[default1]:[rank1]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask)
[default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default1]:[rank1]: return self._call_impl(*args, **kwargs)
[default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default1]:[rank1]: return forward_call(*args, **kwargs)
[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 360, in forward
[default1]:[rank1]: qkv_states = self.qkv_proj(
[default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default1]:[rank1]: return self._call_impl(*args, **kwargs)
[default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default1]:[rank1]: return forward_call(*args, **kwargs)
[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 87, in forward
[default1]:[rank1]: return column_linear(
[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 359, in column_linear
[default1]:[rank1]: return F.linear(input, weight, bias)
[default1]:[rank1]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 768.00 MiB. GPU  has a total capacity of 79.33 GiB of which 571.94 MiB is free. Including non-PyTorch memory, this process has 78.76 GiB memory in use. Of the allocated memory 66.91 GiB is allocated by PyTorch, and 400.66 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[default2]:[rank2]: Traceback (most recent call last):
[default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in <module>
[default2]:[rank2]: trainer.train(dataloader)
[default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train
[default2]:[rank2]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader)
[default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step
[default2]:[rank2]: outputs = self.pipeline_engine.train_batch_iter(
[default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter
[default2]:[rank2]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model)
[default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward
[default2]:[rank2]: output = model(**micro_batch)
[default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default2]:[rank2]: return self._call_impl(*args, **kwargs)
[default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default2]:[rank2]: return forward_call(*args, **kwargs)
[default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward
[default2]:[rank2]: sharded_logits = self.model(
[default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default2]:[rank2]: return self._call_impl(*args, **kwargs)
[default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default2]:[rank2]: return forward_call(*args, **kwargs)
[default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward
[default2]:[rank2]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0]
[default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states
[default2]:[rank2]: hidden_encoder_states = encoder_block(**hidden_encoder_states)
[default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default2]:[rank2]: return self._call_impl(*args, **kwargs)
[default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default2]:[rank2]: return forward_call(*args, **kwargs)
[default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward
[default2]:[rank2]: output = self.pp_block(**new_kwargs)
[default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default2]:[rank2]: return self._call_impl(*args, **kwargs)
[default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default2]:[rank2]: return forward_call(*args, **kwargs)
[default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward
[default2]:[rank2]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask)
[default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default2]:[rank2]: return self._call_impl(*args, **kwargs)
[default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default2]:[rank2]: return forward_call(*args, **kwargs)
[default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 360, in forward
[default2]:[rank2]: qkv_states = self.qkv_proj(
[default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default2]:[rank2]: return self._call_impl(*args, **kwargs)
[default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default2]:[rank2]: return forward_call(*args, **kwargs)
[default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 87, in forward
[default2]:[rank2]: return column_linear(
[default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 359, in column_linear
[default2]:[rank2]: return F.linear(input, weight, bias)
[default2]:[rank2]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 768.00 MiB. GPU  has a total capacity of 79.33 GiB of which 571.94 MiB is free. Including non-PyTorch memory, this process has 78.76 GiB memory in use. Of the allocated memory 66.91 GiB is allocated by PyTorch, and 400.66 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[default5]:[rank5]: Traceback (most recent call last):
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in <module>
[default5]:[rank5]: trainer.train(dataloader)
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train
[default5]:[rank5]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader)
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step
[default5]:[rank5]: outputs = self.pipeline_engine.train_batch_iter(
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter
[default5]:[rank5]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model)
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward
[default5]:[rank5]: output = model(**micro_batch)
[default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default5]:[rank5]: return self._call_impl(*args, **kwargs)
[default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default5]:[rank5]: return forward_call(*args, **kwargs)
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward
[default5]:[rank5]: sharded_logits = self.model(
[default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default5]:[rank5]: return self._call_impl(*args, **kwargs)
[default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default5]:[rank5]: return forward_call(*args, **kwargs)
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward
[default5]:[rank5]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0]
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states
[default5]:[rank5]: hidden_encoder_states = encoder_block(**hidden_encoder_states)
[default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default5]:[rank5]: return self._call_impl(*args, **kwargs)
[default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default5]:[rank5]: return forward_call(*args, **kwargs)
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward
[default5]:[rank5]: new_kwargs[name] = recv_from_pipeline_state_buffer(
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer
[default5]:[rank5]: pipeline_state.run_communication()
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication
[default5]:[rank5]: recv_activation_tensor = recv_activation()
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__
[default5]:[rank5]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0]
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors
[default5]:[rank5]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag)
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors
[default5]:[rank5]: meta = self._recv_meta(from_rank=from_rank, tag=tag)
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta
[default5]:[rank5]: dist.recv(
[default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[default5]:[rank5]: return func(*args, **kwargs)
[default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv
[default5]:[rank5]: pg.recv([tensor], group_src_rank, tag).wait()
[default5]:[rank5]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer
[default5]:[rank5]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first):
[default5]:[rank5]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7ffad5c8a897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so)
[default5]:[rank5]: frame #1: <unknown function> + 0x5b3a23e (0x7ffb0f7a723e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default5]:[rank5]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::string>, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x2c7 (0x7ffb0f7a1c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default5]:[rank5]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7ffb0f7a1f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default5]:[rank5]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7ffb0f7a2fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default5]:[rank5]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7ffb0f757371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default5]:[rank5]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7ffb0f757371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default5]:[rank5]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7ffb0f757371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default5]:[rank5]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7ffb0f757371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default5]:[rank5]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7ffad6f64189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default5]:[rank5]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7ffad6f6b610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default5]:[rank5]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector<at::Tensor, std::allocator<at::Tensor> >&, int, int) + 0x5f8 (0x7ffad6f8a978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default5]:[rank5]: frame #12: <unknown function> + 0x5adc309 (0x7ffb0f749309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default5]:[rank5]: frame #13: <unknown function> + 0x5ae6f10 (0x7ffb0f753f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default5]:[rank5]: frame #14: <unknown function> + 0x5ae6fa5 (0x7ffb0f753fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default5]:[rank5]: frame #15: <unknown function> + 0x5124446 (0x7ffb0ed91446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default5]:[rank5]: frame #16: <unknown function> + 0x1acf4b8 (0x7ffb0b73c4b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default5]:[rank5]: frame #17: <unknown function> + 0x5aee004 (0x7ffb0f75b004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default5]:[rank5]: frame #18: <unknown function> + 0x5af36b5 (0x7ffb0f7606b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default5]:[rank5]: frame #19: <unknown function> + 0xd2631e (0x7ffb2234a31e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
[default5]:[rank5]: frame #20: <unknown function> + 0x47def4 (0x7ffb21aa1ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
[default5]:[rank5]: frame #21: <unknown function> + 0x1445a6 (0x56058f1405a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #22: _PyObject_MakeTpCall + 0x26b (0x56058f139a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #23: <unknown function> + 0x150866 (0x56058f14c866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x56058f135142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #25: _PyFunction_Vectorcall + 0x6c (0x56058f140a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #26: PyObject_Call + 0xbc (0x56058f14cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x56058f1332b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #28: _PyFunction_Vectorcall + 0x6c (0x56058f140a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x56058f1318fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #30: <unknown function> + 0x150582 (0x56058f14c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x56058f1318fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #32: <unknown function> + 0x150582 (0x56058f14c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x56058f1318fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #34: <unknown function> + 0x150582 (0x56058f14c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x56058f1318fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x56058f138f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #37: _PyObject_Call_Prepend + 0x69 (0x56058f14ac39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #38: <unknown function> + 0x211239 (0x56058f20d239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #39: _PyObject_MakeTpCall + 0x26b (0x56058f139a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x56058f1353e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #41: _PyFunction_Vectorcall + 0x6c (0x56058f140a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x56058f130c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #43: _PyFunction_Vectorcall + 0x6c (0x56058f140a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x56058f1318fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #45: <unknown function> + 0x150582 (0x56058f14c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #46: PyObject_Call + 0xbc (0x56058f14cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x56058f1332b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #48: <unknown function> + 0x150582 (0x56058f14c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #49: PyObject_Call + 0xbc (0x56058f14cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x56058f1332b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #51: _PyFunction_Vectorcall + 0x6c (0x56058f140a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x56058f139007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #53: _PyObject_Call_Prepend + 0x69 (0x56058f14ac39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #54: <unknown function> + 0x211239 (0x56058f20d239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #55: PyObject_Call + 0x207 (0x56058f14d067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x56058f1332b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #57: <unknown function> + 0x150582 (0x56058f14c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x56058f1318fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #59: <unknown function> + 0x150582 (0x56058f14c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #60: PyObject_Call + 0xbc (0x56058f14cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x56058f1332b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #62: <unknown function> + 0x150582 (0x56058f14c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #63: PyObject_Call + 0xbc (0x56058f14cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: . This may indicate a possible application crash on rank 0 or a network set up issue.
[default6]:[rank6]: Traceback (most recent call last):
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in <module>
[default6]:[rank6]: trainer.train(dataloader)
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train
[default6]:[rank6]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader)
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step
[default6]:[rank6]: outputs = self.pipeline_engine.train_batch_iter(
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter
[default6]:[rank6]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model)
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward
[default6]:[rank6]: output = model(**micro_batch)
[default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default6]:[rank6]: return self._call_impl(*args, **kwargs)
[default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default6]:[rank6]: return forward_call(*args, **kwargs)
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward
[default6]:[rank6]: sharded_logits = self.model(
[default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default6]:[rank6]: return self._call_impl(*args, **kwargs)
[default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default6]:[rank6]: return forward_call(*args, **kwargs)
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward
[default6]:[rank6]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0]
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states
[default6]:[rank6]: hidden_encoder_states = encoder_block(**hidden_encoder_states)
[default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default6]:[rank6]: return self._call_impl(*args, **kwargs)
[default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default6]:[rank6]: return forward_call(*args, **kwargs)
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward
[default6]:[rank6]: new_kwargs[name] = recv_from_pipeline_state_buffer(
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer
[default6]:[rank6]: pipeline_state.run_communication()
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication
[default6]:[rank6]: recv_activation_tensor = recv_activation()
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__
[default6]:[rank6]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0]
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors
[default6]:[rank6]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag)
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors
[default6]:[rank6]: meta = self._recv_meta(from_rank=from_rank, tag=tag)
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta
[default6]:[rank6]: dist.recv(
[default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[default6]:[rank6]: return func(*args, **kwargs)
[default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv
[default6]:[rank6]: pg.recv([tensor], group_src_rank, tag).wait()
[default6]:[rank6]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer
[default6]:[rank6]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first):
[default6]:[rank6]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f768e168897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so)
[default6]:[rank6]: frame #1: <unknown function> + 0x5b3a23e (0x7f76c7c8523e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default6]:[rank6]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::string>, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x2c7 (0x7f76c7c7fc87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default6]:[rank6]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f76c7c7ff82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default6]:[rank6]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f76c7c80fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default6]:[rank6]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f76c7c35371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default6]:[rank6]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f76c7c35371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default6]:[rank6]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f76c7c35371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default6]:[rank6]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f76c7c35371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default6]:[rank6]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f768f442189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default6]:[rank6]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f768f449610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default6]:[rank6]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector<at::Tensor, std::allocator<at::Tensor> >&, int, int) + 0x5f8 (0x7f768f468978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default6]:[rank6]: frame #12: <unknown function> + 0x5adc309 (0x7f76c7c27309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default6]:[rank6]: frame #13: <unknown function> + 0x5ae6f10 (0x7f76c7c31f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default6]:[rank6]: frame #14: <unknown function> + 0x5ae6fa5 (0x7f76c7c31fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default6]:[rank6]: frame #15: <unknown function> + 0x5124446 (0x7f76c726f446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default6]:[rank6]: frame #16: <unknown function> + 0x1acf4b8 (0x7f76c3c1a4b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default6]:[rank6]: frame #17: <unknown function> + 0x5aee004 (0x7f76c7c39004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default6]:[rank6]: frame #18: <unknown function> + 0x5af36b5 (0x7f76c7c3e6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default6]:[rank6]: frame #19: <unknown function> + 0xd2631e (0x7f76da82831e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
[default6]:[rank6]: frame #20: <unknown function> + 0x47def4 (0x7f76d9f7fef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
[default6]:[rank6]: frame #21: <unknown function> + 0x1445a6 (0x55d0567af5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55d0567a8a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #23: <unknown function> + 0x150866 (0x55d0567bb866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55d0567a4142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55d0567afa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #26: PyObject_Call + 0xbc (0x55d0567bbf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55d0567a22b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55d0567afa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55d0567a08fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #30: <unknown function> + 0x150582 (0x55d0567bb582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55d0567a08fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #32: <unknown function> + 0x150582 (0x55d0567bb582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55d0567a08fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #34: <unknown function> + 0x150582 (0x55d0567bb582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55d0567a08fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55d0567a7f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55d0567b9c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #38: <unknown function> + 0x211239 (0x55d05687c239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55d0567a8a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55d0567a43e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55d0567afa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55d05679fc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55d0567afa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55d0567a08fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #45: <unknown function> + 0x150582 (0x55d0567bb582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #46: PyObject_Call + 0xbc (0x55d0567bbf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55d0567a22b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #48: <unknown function> + 0x150582 (0x55d0567bb582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #49: PyObject_Call + 0xbc (0x55d0567bbf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55d0567a22b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55d0567afa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55d0567a8007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55d0567b9c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #54: <unknown function> + 0x211239 (0x55d05687c239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #55: PyObject_Call + 0x207 (0x55d0567bc067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55d0567a22b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #57: <unknown function> + 0x150582 (0x55d0567bb582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55d0567a08fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #59: <unknown function> + 0x150582 (0x55d0567bb582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #60: PyObject_Call + 0xbc (0x55d0567bbf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55d0567a22b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #62: <unknown function> + 0x150582 (0x55d0567bb582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #63: PyObject_Call + 0xbc (0x55d0567bbf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: . This may indicate a possible application crash on rank 0 or a network set up issue.
[default4]:[rank4]: Traceback (most recent call last):
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in <module>
[default4]:[rank4]: trainer.train(dataloader)
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train
[default4]:[rank4]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader)
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step
[default4]:[rank4]: outputs = self.pipeline_engine.train_batch_iter(
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter
[default4]:[rank4]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model)
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward
[default4]:[rank4]: output = model(**micro_batch)
[default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default4]:[rank4]: return self._call_impl(*args, **kwargs)
[default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default4]:[rank4]: return forward_call(*args, **kwargs)
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward
[default4]:[rank4]: sharded_logits = self.model(
[default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default4]:[rank4]: return self._call_impl(*args, **kwargs)
[default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default4]:[rank4]: return forward_call(*args, **kwargs)
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward
[default4]:[rank4]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0]
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states
[default4]:[rank4]: hidden_encoder_states = encoder_block(**hidden_encoder_states)
[default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default4]:[rank4]: return self._call_impl(*args, **kwargs)
[default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default4]:[rank4]: return forward_call(*args, **kwargs)
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward
[default4]:[rank4]: new_kwargs[name] = recv_from_pipeline_state_buffer(
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer
[default4]:[rank4]: pipeline_state.run_communication()
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication
[default4]:[rank4]: recv_activation_tensor = recv_activation()
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__
[default4]:[rank4]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0]
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors
[default4]:[rank4]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag)
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors
[default4]:[rank4]: meta = self._recv_meta(from_rank=from_rank, tag=tag)
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta
[default4]:[rank4]: dist.recv(
[default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[default4]:[rank4]: return func(*args, **kwargs)
[default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv
[default4]:[rank4]: pg.recv([tensor], group_src_rank, tag).wait()
[default4]:[rank4]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer
[default4]:[rank4]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first):
[default4]:[rank4]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f642bf30897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so)
[default4]:[rank4]: frame #1: <unknown function> + 0x5b3a23e (0x7f6465a4d23e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default4]:[rank4]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::string>, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x2c7 (0x7f6465a47c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default4]:[rank4]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f6465a47f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default4]:[rank4]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f6465a48fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default4]:[rank4]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f64659fd371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default4]:[rank4]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f64659fd371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default4]:[rank4]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f64659fd371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default4]:[rank4]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f64659fd371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default4]:[rank4]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f642d20a189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default4]:[rank4]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f642d211610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default4]:[rank4]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector<at::Tensor, std::allocator<at::Tensor> >&, int, int) + 0x5f8 (0x7f642d230978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default4]:[rank4]: frame #12: <unknown function> + 0x5adc309 (0x7f64659ef309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default4]:[rank4]: frame #13: <unknown function> + 0x5ae6f10 (0x7f64659f9f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default4]:[rank4]: frame #14: <unknown function> + 0x5ae6fa5 (0x7f64659f9fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default4]:[rank4]: frame #15: <unknown function> + 0x5124446 (0x7f6465037446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default4]:[rank4]: frame #16: <unknown function> + 0x1acf4b8 (0x7f64619e24b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default4]:[rank4]: frame #17: <unknown function> + 0x5aee004 (0x7f6465a01004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default4]:[rank4]: frame #18: <unknown function> + 0x5af36b5 (0x7f6465a066b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default4]:[rank4]: frame #19: <unknown function> + 0xd2631e (0x7f64785f031e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
[default4]:[rank4]: frame #20: <unknown function> + 0x47def4 (0x7f6477d47ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
[default4]:[rank4]: frame #21: <unknown function> + 0x1445a6 (0x55a12e86c5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55a12e865a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #23: <unknown function> + 0x150866 (0x55a12e878866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55a12e861142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55a12e86ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #26: PyObject_Call + 0xbc (0x55a12e878f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55a12e85f2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55a12e86ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55a12e85d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #30: <unknown function> + 0x150582 (0x55a12e878582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55a12e85d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #32: <unknown function> + 0x150582 (0x55a12e878582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55a12e85d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #34: <unknown function> + 0x150582 (0x55a12e878582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55a12e85d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55a12e864f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: Traceback (most recent call last):
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in <module>
[default7]:[rank7]: trainer.train(dataloader)
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train
[default7]:[rank7]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader)
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step
[default7]:[rank7]: outputs = self.pipeline_engine.train_batch_iter(
[default4]:[rank4]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55a12e876c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter
[default7]:[rank7]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model)
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward
[default4]:[rank4]: frame #38: <unknown function> + 0x211239 (0x55a12e939239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55a12e865a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55a12e8613e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: output = model(**micro_batch)
[default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default4]:[rank4]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55a12e86ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55a12e85cc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: return self._call_impl(*args, **kwargs)
[default4]:[rank4]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55a12e86ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55a12e85d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #45: <unknown function> + 0x150582 (0x55a12e878582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default7]:[rank7]: return forward_call(*args, **kwargs)
[default4]:[rank4]: frame #46: PyObject_Call + 0xbc (0x55a12e878f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55a12e85f2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #48: <unknown function> + 0x150582 (0x55a12e878582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward
[default7]:[rank7]: sharded_logits = self.model(
[default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default4]:[rank4]: frame #49: PyObject_Call + 0xbc (0x55a12e878f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: return self._call_impl(*args, **kwargs)
[default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default7]:[rank7]: return forward_call(*args, **kwargs)
[default4]:[rank4]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55a12e85f2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward
[default7]:[rank7]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0]
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states
[default7]:[rank7]: hidden_encoder_states = encoder_block(**hidden_encoder_states)
[default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default4]:[rank4]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55a12e86ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: return self._call_impl(*args, **kwargs)
[default4]:[rank4]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55a12e865007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default7]:[rank7]: return forward_call(*args, **kwargs)
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward
[default7]:[rank7]: new_kwargs[name] = recv_from_pipeline_state_buffer(
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer
[default4]:[rank4]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55a12e876c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: pipeline_state.run_communication()
[default4]:[rank4]: frame #54: <unknown function> + 0x211239 (0x55a12e939239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #55: PyObject_Call + 0x207 (0x55a12e879067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55a12e85f2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #57: <unknown function> + 0x150582 (0x55a12e878582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55a12e85d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #59: <unknown function> + 0x150582 (0x55a12e878582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication
[default7]:[rank7]: recv_activation_tensor = recv_activation()
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__
[default4]:[rank4]: frame #60: PyObject_Call + 0xbc (0x55a12e878f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0]
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors
[default7]:[rank7]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag)
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors
[default7]:[rank7]: meta = self._recv_meta(from_rank=from_rank, tag=tag)
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta
[default7]:[rank7]: dist.recv(
[default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[default4]:[rank4]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55a12e85f2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: return func(*args, **kwargs)
[default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv
[default4]:[rank4]: frame #62: <unknown function> + 0x150582 (0x55a12e878582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #63: PyObject_Call + 0xbc (0x55a12e878f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: . This may indicate a possible application crash on rank 0 or a network set up issue.
[default7]:[rank7]: pg.recv([tensor], group_src_rank, tag).wait()
[default7]:[rank7]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer
[default7]:[rank7]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first):
[default7]:[rank7]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f3177253897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so)
[default7]:[rank7]: frame #1: <unknown function> + 0x5b3a23e (0x7f31b0d7023e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default7]:[rank7]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::string>, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x2c7 (0x7f31b0d6ac87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default7]:[rank7]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f31b0d6af82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default7]:[rank7]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f31b0d6bfd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default7]:[rank7]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f31b0d20371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default7]:[rank7]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f31b0d20371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default7]:[rank7]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f31b0d20371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default7]:[rank7]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f31b0d20371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default7]:[rank7]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f317852d189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default7]:[rank7]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f3178534610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default7]:[rank7]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector<at::Tensor, std::allocator<at::Tensor> >&, int, int) + 0x5f8 (0x7f3178553978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default7]:[rank7]: frame #12: <unknown function> + 0x5adc309 (0x7f31b0d12309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default7]:[rank7]: frame #13: <unknown function> + 0x5ae6f10 (0x7f31b0d1cf10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default7]:[rank7]: frame #14: <unknown function> + 0x5ae6fa5 (0x7f31b0d1cfa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default7]:[rank7]: frame #15: <unknown function> + 0x5124446 (0x7f31b035a446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default7]:[rank7]: frame #16: <unknown function> + 0x1acf4b8 (0x7f31acd054b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default7]:[rank7]: frame #17: <unknown function> + 0x5aee004 (0x7f31b0d24004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default7]:[rank7]: frame #18: <unknown function> + 0x5af36b5 (0x7f31b0d296b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default7]:[rank7]: frame #19: <unknown function> + 0xd2631e (0x7f31c391331e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
[default7]:[rank7]: frame #20: <unknown function> + 0x47def4 (0x7f31c306aef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
[default7]:[rank7]: frame #21: <unknown function> + 0x1445a6 (0x55f97cbba5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55f97cbb3a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #23: <unknown function> + 0x150866 (0x55f97cbc6866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55f97cbaf142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55f97cbbaa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #26: PyObject_Call + 0xbc (0x55f97cbc6f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55f97cbad2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55f97cbbaa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55f97cbab8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #30: <unknown function> + 0x150582 (0x55f97cbc6582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55f97cbab8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #32: <unknown function> + 0x150582 (0x55f97cbc6582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55f97cbab8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #34: <unknown function> + 0x150582 (0x55f97cbc6582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55f97cbab8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55f97cbb2f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55f97cbc4c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #38: <unknown function> + 0x211239 (0x55f97cc87239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55f97cbb3a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55f97cbaf3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55f97cbbaa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55f97cbaac5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55f97cbbaa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55f97cbab8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #45: <unknown function> + 0x150582 (0x55f97cbc6582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #46: PyObject_Call + 0xbc (0x55f97cbc6f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55f97cbad2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #48: <unknown function> + 0x150582 (0x55f97cbc6582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #49: PyObject_Call + 0xbc (0x55f97cbc6f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55f97cbad2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55f97cbbaa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55f97cbb3007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55f97cbc4c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #54: <unknown function> + 0x211239 (0x55f97cc87239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #55: PyObject_Call + 0x207 (0x55f97cbc7067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55f97cbad2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #57: <unknown function> + 0x150582 (0x55f97cbc6582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55f97cbab8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #59: <unknown function> + 0x150582 (0x55f97cbc6582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #60: PyObject_Call + 0xbc (0x55f97cbc6f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55f97cbad2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #62: <unknown function> + 0x150582 (0x55f97cbc6582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #63: PyObject_Call + 0xbc (0x55f97cbc6f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: . This may indicate a possible application crash on rank 0 or a network set up issue.
W0703 22:48:41.195000 140318128015168 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1031789 closing signal SIGTERM
W0703 22:48:41.195000 140318128015168 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1031790 closing signal SIGTERM
W0703 22:48:41.196000 140318128015168 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1031791 closing signal SIGTERM
W0703 22:48:41.196000 140318128015168 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1031792 closing signal SIGTERM
E0703 22:48:42.010000 140318128015168 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 1031785) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10
Traceback (most recent call last):
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main
run(args)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2024-07-03_22:48:41
host : ip-26-0-161-178.ec2.internal
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 1031786)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2024-07-03_22:48:41
host : ip-26-0-161-178.ec2.internal
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 1031787)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2024-07-03_22:48:41
host : ip-26-0-161-178.ec2.internal
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 1031788)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-07-03_22:48:41
host : ip-26-0-161-178.ec2.internal
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 1031785)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
srun: error: ip-26-0-161-178: task 0: Exited with exit code 1
Consider using `hf_transfer` for faster uploads. This solution comes with some limitations. See https://huggingface.co/docs/huggingface_hub/hf_transfer for more details.