3outeille's picture
3outeille HF staff
Upload llama-1B/8_GPUS/dp-2_tp-2_pp-2_mbz-64
eded679 verified
raw
history blame
115 kB
========================
START TIME: Wed Jul 3 21:27:20 UTC 2024
python3 version = Python 3.10.14
========================
The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /admin/home/ferdinand_mom/.cache/huggingface/token
Login successful
Already on 'bench_cluster'
M examples/config_tiny_llama.py
M examples/config_tiny_llama.yaml
M examples/train_tiny_llama.sh
M src/nanotron/models/llama.py
M src/nanotron/trainer.py
Your branch is up to date with 'origin/bench_cluster'.
Job status: RUNNING
W0703 21:27:25.777000 140016841295680 torch/distributed/run.py:757]
W0703 21:27:25.777000 140016841295680 torch/distributed/run.py:757] *****************************************
W0703 21:27:25.777000 140016841295680 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0703 21:27:25.777000 140016841295680 torch/distributed/run.py:757] *****************************************
[default0]:07/03/2024 21:27:47 [WARNING|DP=0|PP=0|TP=0|ip-26-0-174-36]: [Vocab Size Padding] Padded vocab (size: 50257) with 1 dummy tokens (new size: 50258)
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: Config:
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: Config(general=GeneralArgs(project='bench_cluster',
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: run='%date_%jobid',
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: seed=42,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: step=None,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: consumed_train_samples=None,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: benchmark_csv_path=None,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: ignore_sanity_checks=True),
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: parallelism=ParallelismArgs(dp=2,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: pp=2,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: tp=2,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: pp_engine=<nanotron.parallel.pipeline_parallel.engine.OneForwardOneBackwardPipelineEngine object at 0x7f105eff0940>,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: tp_mode=<TensorParallelLinearMode.REDUCE_SCATTER: 2>,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: tp_linear_async_communication=False,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: expert_parallel_size=1),
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: model=ModelArgs(model_config=LlamaConfig(bos_token_id=1,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: eos_token_id=2,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: hidden_act='silu',
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: hidden_size=2048,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: initializer_range=0.02,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: intermediate_size=4096,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: is_llama_config=True,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: max_position_embeddings=4096,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: num_attention_heads=32,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: num_hidden_layers=24,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: num_key_value_heads=32,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: pad_token_id=None,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: pretraining_tp=1,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: rms_norm_eps=1e-05,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: rope_scaling=None,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: rope_theta=10000.0,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: tie_word_embeddings=True,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: use_cache=True,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: vocab_size=50258),
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: init_method=RandomInit(std=0.025),
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: dtype=torch.bfloat16,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: make_vocab_size_divisible_by=1,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: ddp_bucket_cap_mb=25),
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: tokenizer=TokenizerArgs(tokenizer_name_or_path='openai-community/gpt2',
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: tokenizer_revision=None,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: tokenizer_max_length=None),
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: checkpoints=CheckpointsArgs(checkpoints_path=Path('/dev/null'),
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: checkpoint_interval=100000,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: save_initial_state=False,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: resume_checkpoint_path=None,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: checkpoints_path_is_shared_file_system=False),
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: logging=LoggingArgs(log_level='info',
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: log_level_replica='info',
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: iteration_step_info_interval=1),
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: tokens=TokensArgs(sequence_length=4096,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: train_steps=20,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: micro_batch_size=64,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: batch_accumulation_per_replica=8,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: val_check_interval=-1,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: limit_val_batches=0,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: limit_test_batches=0),
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: optimizer=OptimizerArgs(optimizer_factory=AdamWOptimizerArgs(adam_eps=1e-08,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: adam_beta1=0.9,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: adam_beta2=0.95,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: torch_adam_is_fused=True,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: name='adamW'),
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: zero_stage=1,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: weight_decay=0.01,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: clip_grad=1.0,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: accumulate_grad_in_fp32=True,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: learning_rate_scheduler=LRSchedulerArgs(learning_rate=0.0001,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: lr_warmup_steps=1,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: lr_warmup_style='linear',
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: lr_decay_style='linear',
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: lr_decay_steps=19,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: lr_decay_starting_step=None,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: min_decay_lr=1e-05)),
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: data_stages=[DatasetStageArgs(name='Training Stage',
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: start_training_step=1,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: data=DataArgs(dataset=PretrainDatasetsArgs(hf_dataset_or_datasets='roneneldan/TinyStories',
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: hf_dataset_splits='train',
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: hf_dataset_config_name=None,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: dataset_processing_num_proc_per_process=64,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: dataset_overwrite_cache=False,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: text_column_name='text'),
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: seed=42,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: num_loading_workers=0))],
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: profiler=ProfilerArgs(profiler_export_path=Path('/fsx/ferdinandmom/ferdinand-hf/bench_cluster/results/llama-1B/8_GPUS/dp-2_tp-2_pp-2_mbz-64')),
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: lighteval=None)
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: Model Config:
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: LlamaConfig(bos_token_id=1,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: eos_token_id=2,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: hidden_act='silu',
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: hidden_size=2048,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: initializer_range=0.02,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: intermediate_size=4096,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: is_llama_config=True,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: max_position_embeddings=4096,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: num_attention_heads=32,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: num_hidden_layers=24,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: num_key_value_heads=32,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: pad_token_id=None,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: pretraining_tp=1,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: rms_norm_eps=1e-05,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: rope_scaling=None,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: rope_theta=10000.0,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: tie_word_embeddings=True,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: use_cache=True,
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: vocab_size=50258)
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: Building model..
[default0]:07/03/2024 21:27:47 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: Setting PP block ranks...
[default5]:07/03/2024 21:27:59 [INFO|DP=0|PP=1|TP=1|ip-26-0-174-36]: Local number of parameters: 261M (498.24MiB)
[default5]:07/03/2024 21:27:59 [INFO|DP=0|PP=1|TP=1|ip-26-0-174-36]: [After model building] Memory usage: 508.26MiB. Peak allocated: 510.29MiB Peak reserved: 526.00MiB
[default5]:07/03/2024 21:27:59 [INFO|DP=0|PP=1|TP=1|ip-26-0-174-36]: No checkpoint path provided.
[default1]:07/03/2024 21:27:59 [INFO|DP=0|PP=0|TP=1|ip-26-0-174-36]: Local number of parameters: 345M (658.27MiB)
[default1]:07/03/2024 21:27:59 [INFO|DP=0|PP=0|TP=1|ip-26-0-174-36]: [After model building] Memory usage: 672.29MiB. Peak allocated: 674.32MiB Peak reserved: 690.00MiB
[default1]:07/03/2024 21:27:59 [INFO|DP=0|PP=0|TP=1|ip-26-0-174-36]: No checkpoint path provided.
[default2]:07/03/2024 21:27:59 [INFO|DP=1|PP=0|TP=0|ip-26-0-174-36]: No checkpoint path provided.
[default6]:07/03/2024 21:27:59 [INFO|DP=1|PP=1|TP=0|ip-26-0-174-36]: No checkpoint path provided.
[default4]:07/03/2024 21:27:59 [INFO|DP=0|PP=1|TP=0|ip-26-0-174-36]: Local number of parameters: 261M (498.24MiB)
[default4]:07/03/2024 21:27:59 [INFO|DP=0|PP=1|TP=0|ip-26-0-174-36]: [After model building] Memory usage: 508.26MiB. Peak allocated: 510.29MiB Peak reserved: 526.00MiB
[default4]:07/03/2024 21:27:59 [INFO|DP=0|PP=1|TP=0|ip-26-0-174-36]: No checkpoint path provided.
[default0]:07/03/2024 21:27:59 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: Total number of parameters: 1.21G (2313.02MiB)
[default0]:07/03/2024 21:27:59 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: Local number of parameters: 345M (658.27MiB)
[default0]:07/03/2024 21:27:59 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: [After model building] Memory usage: 672.29MiB. Peak allocated: 674.32MiB Peak reserved: 690.00MiB
[default0]:07/03/2024 21:27:59 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: No checkpoint path provided.
[default0]:07/03/2024 21:27:59 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: Parametrizing model parameters using StandardParametrizator
[default3]:07/03/2024 21:27:59 [INFO|DP=1|PP=0|TP=1|ip-26-0-174-36]: No checkpoint path provided.
[default7]:07/03/2024 21:27:59 [INFO|DP=1|PP=1|TP=1|ip-26-0-174-36]: No checkpoint path provided.
[default0]:07/03/2024 21:28:01 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: [Optimizer Building] Using LearningRateForSP as learning rate
[default0]:07/03/2024 21:28:01 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: [ZeRO sharding] Size of optimizer params per rank:
[default0]:07/03/2024 21:28:01 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: [ZeRO sharding] DP Rank 0 has 173M out of 345M (50.00%) params' optimizer states
[default0]:07/03/2024 21:28:01 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: [ZeRO sharding] DP Rank 1 has 173M out of 345M (50.00%) params' optimizer states
[default0]:07/03/2024 21:28:03 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: [Training Plan] Stage Training Stage has 19 remaining training steps and has consumed 0 samples
[default0]:07/03/2024 21:28:03 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: Using `datasets` library
[default0]:07/03/2024 21:28:03 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: Loading tokenizer from openai-community/gpt2 and transformers/hf_hub versions ('4.41.2', '0.23.4')
[default0]:07/03/2024 21:28:03 [WARNING|DP=0|PP=0|TP=0|ip-26-0-174-36]: Repo card metadata block was not found. Setting CardData to empty.
[default0]:Repo card metadata block was not found. Setting CardData to empty.
[default0]:07/03/2024 21:28:05 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: [Training Plan] There are 1 training stages
[default0]:07/03/2024 21:28:05 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: [Stage Training Stage] start from step 1
[default0]:07/03/2024 21:28:05 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]:
[default0]:07/03/2024 21:28:05 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: [Start training] datetime: 2024-07-03 21:28:05.641987 | mbs: 64 | grad_accum: 8 | global_batch_size: 1024 | sequence_length: 4096 | train_steps: 20 | start_iteration_step: 0 | consumed_train_samples: 0
[default0]:07/03/2024 21:28:05 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: Resuming training from stage Training Stage, it has trained for 0 samples and has 19 remaining train steps
[default0]:07/03/2024 21:28:05 [INFO|DP=0|PP=0|TP=0|ip-26-0-174-36]: Memory usage: 2647.09MiB. Peak allocated 2647.09MiB. Peak reserved: 2668.00MiB
[default7]:Repo card metadata block was not found. Setting CardData to empty.
[default3]:Repo card metadata block was not found. Setting CardData to empty.
[default5]:07/03/2024 21:28:05 [WARNING|DP=0|PP=1|TP=1|ip-26-0-174-36]: Repo card metadata block was not found. Setting CardData to empty.
[default1]:07/03/2024 21:28:05 [WARNING|DP=0|PP=0|TP=1|ip-26-0-174-36]: Repo card metadata block was not found. Setting CardData to empty.
[default2]:07/03/2024 21:28:05 [WARNING|DP=1|PP=0|TP=0|ip-26-0-174-36]: Repo card metadata block was not found. Setting CardData to empty.
[default6]:07/03/2024 21:28:05 [WARNING|DP=1|PP=1|TP=0|ip-26-0-174-36]: Repo card metadata block was not found. Setting CardData to empty.
[default4]:07/03/2024 21:28:05 [WARNING|DP=0|PP=1|TP=0|ip-26-0-174-36]: Repo card metadata block was not found. Setting CardData to empty.
[default4]:Repo card metadata block was not found. Setting CardData to empty.
[default3]:07/03/2024 21:28:05 [WARNING|DP=1|PP=0|TP=1|ip-26-0-174-36]: Repo card metadata block was not found. Setting CardData to empty.
[default5]:Repo card metadata block was not found. Setting CardData to empty.
[default1]:Repo card metadata block was not found. Setting CardData to empty.
[default7]:07/03/2024 21:28:05 [WARNING|DP=1|PP=1|TP=1|ip-26-0-174-36]: Repo card metadata block was not found. Setting CardData to empty.
[default2]:Repo card metadata block was not found. Setting CardData to empty.
[default6]:Repo card metadata block was not found. Setting CardData to empty.
[default3]:[rank3]: Traceback (most recent call last):
[default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in <module>
[default3]:[rank3]: trainer.train(dataloader)
[default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train
[default3]:[rank3]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader)
[default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step
[default3]:[rank3]: outputs = self.pipeline_engine.train_batch_iter(
[default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter
[default3]:[rank3]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model)
[default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward
[default3]:[rank3]: output = model(**micro_batch)
[default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default3]:[rank3]: return self._call_impl(*args, **kwargs)
[default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default3]:[rank3]: return forward_call(*args, **kwargs)
[default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward
[default3]:[rank3]: sharded_logits = self.model(
[default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default3]:[rank3]: return self._call_impl(*args, **kwargs)
[default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default3]:[rank3]: return forward_call(*args, **kwargs)
[default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward
[default3]:[rank3]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0]
[default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states
[default3]:[rank3]: hidden_encoder_states = encoder_block(**hidden_encoder_states)
[default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default3]:[rank3]: return self._call_impl(*args, **kwargs)
[default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default3]:[rank3]: return forward_call(*args, **kwargs)
[default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward
[default3]:[rank3]: output = self.pp_block(**new_kwargs)
[default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default3]:[rank3]: return self._call_impl(*args, **kwargs)
[default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default3]:[rank3]: return forward_call(*args, **kwargs)
[default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward
[default3]:[rank3]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask)
[default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default3]:[rank3]: return self._call_impl(*args, **kwargs)
[default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default3]:[rank3]: return forward_call(*args, **kwargs)
[default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 360, in forward
[default3]:[rank3]: qkv_states = self.qkv_proj(
[default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default3]:[rank3]: return self._call_impl(*args, **kwargs)
[default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default3]:[rank3]: return forward_call(*args, **kwargs)
[default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 87, in forward
[default3]:[rank3]: return column_linear(
[default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 359, in column_linear
[default3]:[rank3]: return F.linear(input, weight, bias)
[default3]:[rank3]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.50 GiB. GPU  has a total capacity of 79.33 GiB of which 75.94 MiB is free. Including non-PyTorch memory, this process has 79.24 GiB memory in use. Of the allocated memory 67.72 GiB is allocated by PyTorch, and 1.42 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[default1]:[rank1]: Traceback (most recent call last):
[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in <module>
[default1]:[rank1]: trainer.train(dataloader)
[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train
[default1]:[rank1]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader)
[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step
[default1]:[rank1]: outputs = self.pipeline_engine.train_batch_iter(
[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter
[default1]:[rank1]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model)
[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward
[default1]:[rank1]: output = model(**micro_batch)
[default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default1]:[rank1]: return self._call_impl(*args, **kwargs)
[default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default1]:[rank1]: return forward_call(*args, **kwargs)
[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward
[default1]:[rank1]: sharded_logits = self.model(
[default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default1]:[rank1]: return self._call_impl(*args, **kwargs)
[default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default1]:[rank1]: return forward_call(*args, **kwargs)
[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward
[default1]:[rank1]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0]
[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states
[default1]:[rank1]: hidden_encoder_states = encoder_block(**hidden_encoder_states)
[default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default1]:[rank1]: return self._call_impl(*args, **kwargs)
[default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default1]:[rank1]: return forward_call(*args, **kwargs)
[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward
[default1]:[rank1]: output = self.pp_block(**new_kwargs)
[default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default1]:[rank1]: return self._call_impl(*args, **kwargs)
[default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default1]:[rank1]: return forward_call(*args, **kwargs)
[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward
[default1]:[rank1]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask)
[default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default1]:[rank1]: return self._call_impl(*args, **kwargs)
[default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default1]:[rank1]: return forward_call(*args, **kwargs)
[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 360, in forward
[default1]:[rank1]: qkv_states = self.qkv_proj(
[default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default1]:[rank1]: return self._call_impl(*args, **kwargs)
[default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default1]:[rank1]: return forward_call(*args, **kwargs)
[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 87, in forward
[default1]:[rank1]: return column_linear(
[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 359, in column_linear
[default1]:[rank1]: return F.linear(input, weight, bias)
[default1]:[rank1]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.50 GiB. GPU  has a total capacity of 79.33 GiB of which 75.94 MiB is free. Including non-PyTorch memory, this process has 79.24 GiB memory in use. Of the allocated memory 67.72 GiB is allocated by PyTorch, and 1.42 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[default2]:[rank2]: Traceback (most recent call last):
[default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in <module>
[default0]:[rank0]: Traceback (most recent call last):
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in <module>
[default2]:[rank2]: trainer.train(dataloader)
[default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train
[default2]:[rank2]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader)
[default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step
[default2]:[rank2]: outputs = self.pipeline_engine.train_batch_iter(
[default0]:[rank0]: trainer.train(dataloader)
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train
[default0]:[rank0]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader)
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step
[default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter
[default0]:[rank0]: outputs = self.pipeline_engine.train_batch_iter(
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter
[default2]:[rank2]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model)
[default0]:[rank0]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model)
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward
[default0]:[rank0]: output = model(**micro_batch)
[default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default0]:[rank0]: return self._call_impl(*args, **kwargs)
[default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default0]:[rank0]: return forward_call(*args, **kwargs)
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward
[default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward
[default2]:[rank2]: output = model(**micro_batch)
[default0]:[rank0]: sharded_logits = self.model(
[default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default0]:[rank0]: return self._call_impl(*args, **kwargs)
[default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default0]:[rank0]: return forward_call(*args, **kwargs)
[default2]:[rank2]: return self._call_impl(*args, **kwargs)
[default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default2]:[rank2]: return forward_call(*args, **kwargs)
[default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward
[default2]:[rank2]: sharded_logits = self.model(
[default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward
[default0]:[rank0]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0]
[default2]:[rank2]: return self._call_impl(*args, **kwargs)
[default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default2]:[rank2]: return forward_call(*args, **kwargs)
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states
[default0]:[rank0]: hidden_encoder_states = encoder_block(**hidden_encoder_states)
[default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward
[default2]:[rank2]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0]
[default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states
[default2]:[rank2]: hidden_encoder_states = encoder_block(**hidden_encoder_states)
[default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default2]:[rank2]: return self._call_impl(*args, **kwargs)
[default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default0]:[rank0]: return self._call_impl(*args, **kwargs)
[default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default2]:[rank2]: return forward_call(*args, **kwargs)
[default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward
[default2]:[rank2]: output = self.pp_block(**new_kwargs)
[default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default0]:[rank0]: return forward_call(*args, **kwargs)
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward
[default0]:[rank0]: output = self.pp_block(**new_kwargs)
[default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default0]:[rank0]: return self._call_impl(*args, **kwargs)
[default2]:[rank2]: return self._call_impl(*args, **kwargs)
[default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default2]:[rank2]: return forward_call(*args, **kwargs)
[default0]:[rank0]: return forward_call(*args, **kwargs)
[default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward
[default2]:[rank2]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask)
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward
[default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default0]:[rank0]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask)
[default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default0]:[rank0]: return self._call_impl(*args, **kwargs)
[default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default2]:[rank2]: return self._call_impl(*args, **kwargs)
[default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default0]:[rank0]: return forward_call(*args, **kwargs)
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 360, in forward
[default2]:[rank2]: return forward_call(*args, **kwargs)
[default0]:[rank0]: qkv_states = self.qkv_proj(
[default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 360, in forward
[default2]:[rank2]: qkv_states = self.qkv_proj(
[default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default0]:[rank0]: return self._call_impl(*args, **kwargs)
[default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default2]:[rank2]: return self._call_impl(*args, **kwargs)
[default0]:[rank0]: return forward_call(*args, **kwargs)
[default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 87, in forward
[default2]:[rank2]: return forward_call(*args, **kwargs)
[default0]:[rank0]: return column_linear(
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 359, in column_linear
[default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 87, in forward
[default0]:[rank0]: return F.linear(input, weight, bias)
[default0]:[rank0]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.50 GiB. GPU
[default2]:[rank2]: return column_linear(
[default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 359, in column_linear
[default2]:[rank2]: return F.linear(input, weight, bias)
[default2]:[rank2]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.50 GiB. GPU  has a total capacity of 79.33 GiB of which 75.94 MiB is free. Including non-PyTorch memory, this process has 79.24 GiB memory in use. Of the allocated memory 67.72 GiB is allocated by PyTorch, and 1.42 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[default7]:[rank7]: Traceback (most recent call last):
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in <module>
[default7]:[rank7]: trainer.train(dataloader)
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train
[default7]:[rank7]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader)
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step
[default7]:[rank7]: outputs = self.pipeline_engine.train_batch_iter(
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter
[default7]:[rank7]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model)
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward
[default7]:[rank7]: output = model(**micro_batch)
[default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default7]:[rank7]: return self._call_impl(*args, **kwargs)
[default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default7]:[rank7]: return forward_call(*args, **kwargs)
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward
[default7]:[rank7]: sharded_logits = self.model(
[default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default7]:[rank7]: return self._call_impl(*args, **kwargs)
[default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default7]:[rank7]: return forward_call(*args, **kwargs)
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward
[default7]:[rank7]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0]
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states
[default7]:[rank7]: hidden_encoder_states = encoder_block(**hidden_encoder_states)
[default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default7]:[rank7]: return self._call_impl(*args, **kwargs)
[default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default7]:[rank7]: return forward_call(*args, **kwargs)
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward
[default7]:[rank7]: new_kwargs[name] = recv_from_pipeline_state_buffer(
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer
[default7]:[rank7]: pipeline_state.run_communication()
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication
[default7]:[rank7]: recv_activation_tensor = recv_activation()
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__
[default7]:[rank7]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0]
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors
[default7]:[rank7]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag)
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors
[default7]:[rank7]: meta = self._recv_meta(from_rank=from_rank, tag=tag)
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta
[default7]:[rank7]: dist.recv(
[default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[default7]:[rank7]: return func(*args, **kwargs)
[default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv
[default7]:[rank7]: pg.recv([tensor], group_src_rank, tag).wait()
[default7]:[rank7]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer
[default7]:[rank7]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first):
[default7]:[rank7]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7eff38f45897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so)
[default7]:[rank7]: frame #1: <unknown function> + 0x5b3a23e (0x7eff72a6223e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default7]:[rank7]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::string>, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x2c7 (0x7eff72a5cc87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default7]:[rank7]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7eff72a5cf82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default7]:[rank7]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7eff72a5dfd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default7]:[rank7]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7eff72a12371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default7]:[rank7]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7eff72a12371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default7]:[rank7]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7eff72a12371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default7]:[rank7]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7eff72a12371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default7]:[rank7]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7eff3a21f189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default7]:[rank7]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7eff3a226610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default7]:[rank7]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector<at::Tensor, std::allocator<at::Tensor> >&, int, int) + 0x5f8 (0x7eff3a245978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default7]:[rank7]: frame #12: <unknown function> + 0x5adc309 (0x7eff72a04309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default7]:[rank7]: frame #13: <unknown function> + 0x5ae6f10 (0x7eff72a0ef10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default7]:[rank7]: frame #14: <unknown function> + 0x5ae6fa5 (0x7eff72a0efa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default7]:[rank7]: frame #15: <unknown function> + 0x5124446 (0x7eff7204c446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default7]:[rank7]: frame #16: <unknown function> + 0x1acf4b8 (0x7eff6e9f74b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default7]:[rank7]: frame #17: <unknown function> + 0x5aee004 (0x7eff72a16004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default7]:[rank7]: frame #18: <unknown function> + 0x5af36b5 (0x7eff72a1b6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default7]:[rank7]: frame #19: <unknown function> + 0xd2631e (0x7eff8560531e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
[default7]:[rank7]: frame #20: <unknown function> + 0x47def4 (0x7eff84d5cef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
[default7]:[rank7]: frame #21: <unknown function> + 0x1445a6 (0x55720a94b5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55720a944a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #23: <unknown function> + 0x150866 (0x55720a957866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55720a940142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55720a94ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #26: PyObject_Call + 0xbc (0x55720a957f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55720a93e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55720a94ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55720a93c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #30: <unknown function> + 0x150582 (0x55720a957582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55720a93c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #32: <unknown function> + 0x150582 (0x55720a957582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55720a93c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #34: <unknown function> + 0x150582 (0x55720a957582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55720a93c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55720a943f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55720a955c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #38: <unknown function> + 0x211239 (0x55720aa18239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55720a944a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55720a9403e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55720a94ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55720a93bc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55720a94ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55720a93c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #45: <unknown function> + 0x150582 (0x55720a957582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #46: PyObject_Call + 0xbc (0x55720a957f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55720a93e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #48: <unknown function> + 0x150582 (0x55720a957582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #49: PyObject_Call + 0xbc (0x55720a957f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55720a93e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55720a94ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55720a944007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55720a955c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #54: <unknown function> + 0x211239 (0x55720aa18239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #55: PyObject_Call + 0x207 (0x55720a958067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55720a93e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #57: <unknown function> + 0x150582 (0x55720a957582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55720a93c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #59: <unknown function> + 0x150582 (0x55720a957582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #60: PyObject_Call + 0xbc (0x55720a957f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55720a93e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #62: <unknown function> + 0x150582 (0x55720a957582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: frame #63: PyObject_Call + 0xbc (0x55720a957f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default7]:[rank7]: . This may indicate a possible application crash on rank 0 or a network set up issue.
[default4]:[rank4]: Traceback (most recent call last):
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in <module>
[default4]:[rank4]: trainer.train(dataloader)
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train
[default4]:[rank4]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader)
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step
[default4]:[rank4]: outputs = self.pipeline_engine.train_batch_iter(
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter
[default4]:[rank4]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model)
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward
[default4]:[rank4]: output = model(**micro_batch)
[default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default4]:[rank4]: return self._call_impl(*args, **kwargs)
[default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default4]:[rank4]: return forward_call(*args, **kwargs)
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward
[default4]:[rank4]: sharded_logits = self.model(
[default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default4]:[rank4]: return self._call_impl(*args, **kwargs)
[default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default4]:[rank4]: return forward_call(*args, **kwargs)
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward
[default4]:[rank4]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0]
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states
[default4]:[rank4]: hidden_encoder_states = encoder_block(**hidden_encoder_states)
[default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default4]:[rank4]: return self._call_impl(*args, **kwargs)
[default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default4]:[rank4]: return forward_call(*args, **kwargs)
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward
[default4]:[rank4]: new_kwargs[name] = recv_from_pipeline_state_buffer(
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer
[default4]:[rank4]: pipeline_state.run_communication()
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication
[default4]:[rank4]: recv_activation_tensor = recv_activation()
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__
[default4]:[rank4]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0]
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors
[default4]:[rank4]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag)
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors
[default4]:[rank4]: meta = self._recv_meta(from_rank=from_rank, tag=tag)
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta
[default4]:[rank4]: dist.recv(
[default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[default4]:[rank4]: return func(*args, **kwargs)
[default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv
[default4]:[rank4]: pg.recv([tensor], group_src_rank, tag).wait()
[default4]:[rank4]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer
[default4]:[rank4]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first):
[default4]:[rank4]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fe8e06a4897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so)
[default4]:[rank4]: frame #1: <unknown function> + 0x5b3a23e (0x7fe91a1c123e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default4]:[rank4]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::string>, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x2c7 (0x7fe91a1bbc87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default4]:[rank4]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fe91a1bbf82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default4]:[rank4]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fe91a1bcfd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default4]:[rank4]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fe91a171371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default4]:[rank4]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fe91a171371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default4]:[rank4]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fe91a171371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default4]:[rank4]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fe91a171371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default4]:[rank4]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fe8e197e189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default4]:[rank4]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7fe8e1985610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default4]:[rank4]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector<at::Tensor, std::allocator<at::Tensor> >&, int, int) + 0x5f8 (0x7fe8e19a4978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default4]:[rank4]: frame #12: <unknown function> + 0x5adc309 (0x7fe91a163309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default4]:[rank4]: frame #13: <unknown function> + 0x5ae6f10 (0x7fe91a16df10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default4]:[rank4]: frame #14: <unknown function> + 0x5ae6fa5 (0x7fe91a16dfa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default4]:[rank4]: frame #15: <unknown function> + 0x5124446 (0x7fe9197ab446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default4]:[rank4]: frame #16: <unknown function> + 0x1acf4b8 (0x7fe9161564b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default4]:[rank4]: frame #17: <unknown function> + 0x5aee004 (0x7fe91a175004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default4]:[rank4]: frame #18: <unknown function> + 0x5af36b5 (0x7fe91a17a6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default4]:[rank4]: frame #19: <unknown function> + 0xd2631e (0x7fe92cd6431e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
[default4]:[rank4]: frame #20: <unknown function> + 0x47def4 (0x7fe92c4bbef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
[default4]:[rank4]: frame #21: <unknown function> + 0x1445a6 (0x562fec9095a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #22: _PyObject_MakeTpCall + 0x26b (0x562fec902a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #23: <unknown function> + 0x150866 (0x562fec915866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x562fec8fe142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #25: _PyFunction_Vectorcall + 0x6c (0x562fec909a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #26: PyObject_Call + 0xbc (0x562fec915f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x562fec8fc2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #28: _PyFunction_Vectorcall + 0x6c (0x562fec909a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x562fec8fa8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #30: <unknown function> + 0x150582 (0x562fec915582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x562fec8fa8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #32: <unknown function> + 0x150582 (0x562fec915582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x562fec8fa8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #34: <unknown function> + 0x150582 (0x562fec915582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x562fec8fa8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x562fec901f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #37: _PyObject_Call_Prepend + 0x69 (0x562fec913c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #38: <unknown function> + 0x211239 (0x562fec9d6239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #39: _PyObject_MakeTpCall + 0x26b (0x562fec902a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x562fec8fe3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #41: _PyFunction_Vectorcall + 0x6c (0x562fec909a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x562fec8f9c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #43: _PyFunction_Vectorcall + 0x6c (0x562fec909a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x562fec8fa8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #45: <unknown function> + 0x150582 (0x562fec915582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #46: PyObject_Call + 0xbc (0x562fec915f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x562fec8fc2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #48: <unknown function> + 0x150582 (0x562fec915582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #49: PyObject_Call + 0xbc (0x562fec915f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x562fec8fc2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #51: _PyFunction_Vectorcall + 0x6c (0x562fec909a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x562fec902007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #53: _PyObject_Call_Prepend + 0x69 (0x562fec913c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #54: <unknown function> + 0x211239 (0x562fec9d6239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #55: PyObject_Call + 0x207 (0x562fec916067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x562fec8fc2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #57: <unknown function> + 0x150582 (0x562fec915582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x562fec8fa8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #59: <unknown function> + 0x150582 (0x562fec915582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #60: PyObject_Call + 0xbc (0x562fec915f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x562fec8fc2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #62: <unknown function> + 0x150582 (0x562fec915582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: frame #63: PyObject_Call + 0xbc (0x562fec915f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default4]:[rank4]: . This may indicate a possible application crash on rank 0 or a network set up issue.
[default5]:[rank5]: Traceback (most recent call last):
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in <module>
[default5]:[rank5]: trainer.train(dataloader)
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train
[default5]:[rank5]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader)
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step
[default5]:[rank5]: outputs = self.pipeline_engine.train_batch_iter(
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter
[default5]:[rank5]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model)
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward
[default5]:[rank5]: output = model(**micro_batch)
[default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default5]:[rank5]: return self._call_impl(*args, **kwargs)
[default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default5]:[rank5]: return forward_call(*args, **kwargs)
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward
[default5]:[rank5]: sharded_logits = self.model(
[default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default5]:[rank5]: return self._call_impl(*args, **kwargs)
[default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default5]:[rank5]: return forward_call(*args, **kwargs)
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward
[default5]:[rank5]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0]
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states
[default5]:[rank5]: hidden_encoder_states = encoder_block(**hidden_encoder_states)
[default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default5]:[rank5]: return self._call_impl(*args, **kwargs)
[default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default5]:[rank5]: return forward_call(*args, **kwargs)
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward
[default5]:[rank5]: new_kwargs[name] = recv_from_pipeline_state_buffer(
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer
[default5]:[rank5]: pipeline_state.run_communication()
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication
[default5]:[rank5]: recv_activation_tensor = recv_activation()
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__
[default5]:[rank5]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0]
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors
[default5]:[rank5]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag)
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors
[default5]:[rank5]: meta = self._recv_meta(from_rank=from_rank, tag=tag)
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta
[default5]:[rank5]: dist.recv(
[default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[default5]:[rank5]: return func(*args, **kwargs)
[default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv
[default5]:[rank5]: pg.recv([tensor], group_src_rank, tag).wait()
[default5]:[rank5]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer
[default5]:[rank5]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first):
[default5]:[rank5]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f05c0b33897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so)
[default5]:[rank5]: frame #1: <unknown function> + 0x5b3a23e (0x7f05fa65023e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default5]:[rank5]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::string>, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x2c7 (0x7f05fa64ac87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default5]:[rank5]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f05fa64af82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default5]:[rank5]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f05fa64bfd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default5]:[rank5]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f05fa600371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default5]:[rank5]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f05fa600371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default5]:[rank5]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f05fa600371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default5]:[rank5]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f05fa600371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default5]:[rank5]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f05c1e0d189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default5]:[rank5]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f05c1e14610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default5]:[rank5]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector<at::Tensor, std::allocator<at::Tensor> >&, int, int) + 0x5f8 (0x7f05c1e33978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default5]:[rank5]: frame #12: <unknown function> + 0x5adc309 (0x7f05fa5f2309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default5]:[rank5]: frame #13: <unknown function> + 0x5ae6f10 (0x7f05fa5fcf10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default5]:[rank5]: frame #14: <unknown function> + 0x5ae6fa5 (0x7f05fa5fcfa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default5]:[rank5]: frame #15: <unknown function> + 0x5124446 (0x7f05f9c3a446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default5]:[rank5]: frame #16: <unknown function> + 0x1acf4b8 (0x7f05f65e54b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default5]:[rank5]: frame #17: <unknown function> + 0x5aee004 (0x7f05fa604004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default5]:[rank5]: frame #18: <unknown function> + 0x5af36b5 (0x7f05fa6096b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default5]:[rank5]: frame #19: <unknown function> + 0xd2631e (0x7f060d1f331e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
[default5]:[rank5]: frame #20: <unknown function> + 0x47def4 (0x7f060c94aef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
[default5]:[rank5]: frame #21: <unknown function> + 0x1445a6 (0x559be08115a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #22: _PyObject_MakeTpCall + 0x26b (0x559be080aa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #23: <unknown function> + 0x150866 (0x559be081d866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x559be0806142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #25: _PyFunction_Vectorcall + 0x6c (0x559be0811a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #26: PyObject_Call + 0xbc (0x559be081df1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x559be08042b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #28: _PyFunction_Vectorcall + 0x6c (0x559be0811a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x559be08028fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #30: <unknown function> + 0x150582 (0x559be081d582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x559be08028fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #32: <unknown function> + 0x150582 (0x559be081d582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x559be08028fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #34: <unknown function> + 0x150582 (0x559be081d582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x559be08028fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x559be0809f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #37: _PyObject_Call_Prepend + 0x69 (0x559be081bc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #38: <unknown function> + 0x211239 (0x559be08de239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #39: _PyObject_MakeTpCall + 0x26b (0x559be080aa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x559be08063e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #41: _PyFunction_Vectorcall + 0x6c (0x559be0811a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x559be0801c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #43: _PyFunction_Vectorcall + 0x6c (0x559be0811a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x559be08028fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #45: <unknown function> + 0x150582 (0x559be081d582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #46: PyObject_Call + 0xbc (0x559be081df1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x559be08042b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #48: <unknown function> + 0x150582 (0x559be081d582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #49: PyObject_Call + 0xbc (0x559be081df1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x559be08042b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #51: _PyFunction_Vectorcall + 0x6c (0x559be0811a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x559be080a007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #53: _PyObject_Call_Prepend + 0x69 (0x559be081bc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #54: <unknown function> + 0x211239 (0x559be08de239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #55: PyObject_Call + 0x207 (0x559be081e067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x559be08042b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #57: <unknown function> + 0x150582 (0x559be081d582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x559be08028fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #59: <unknown function> + 0x150582 (0x559be081d582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #60: PyObject_Call + 0xbc (0x559be081df1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x559be08042b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #62: <unknown function> + 0x150582 (0x559be081d582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: frame #63: PyObject_Call + 0xbc (0x559be081df1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default5]:[rank5]: . This may indicate a possible application crash on rank 0 or a network set up issue.
[default6]:[rank6]: Traceback (most recent call last):
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in <module>
[default6]:[rank6]: trainer.train(dataloader)
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train
[default6]:[rank6]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader)
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step
[default6]:[rank6]: outputs = self.pipeline_engine.train_batch_iter(
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter
[default6]:[rank6]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model)
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward
[default6]:[rank6]: output = model(**micro_batch)
[default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default6]:[rank6]: return self._call_impl(*args, **kwargs)
[default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default6]:[rank6]: return forward_call(*args, **kwargs)
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward
[default6]:[rank6]: sharded_logits = self.model(
[default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default6]:[rank6]: return self._call_impl(*args, **kwargs)
[default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default6]:[rank6]: return forward_call(*args, **kwargs)
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward
[default6]:[rank6]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0]
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states
[default6]:[rank6]: hidden_encoder_states = encoder_block(**hidden_encoder_states)
[default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[default6]:[rank6]: return self._call_impl(*args, **kwargs)
[default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[default6]:[rank6]: return forward_call(*args, **kwargs)
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward
[default6]:[rank6]: new_kwargs[name] = recv_from_pipeline_state_buffer(
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer
[default6]:[rank6]: pipeline_state.run_communication()
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication
[default6]:[rank6]: recv_activation_tensor = recv_activation()
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__
[default6]:[rank6]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0]
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors
[default6]:[rank6]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag)
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors
[default6]:[rank6]: meta = self._recv_meta(from_rank=from_rank, tag=tag)
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta
[default6]:[rank6]: dist.recv(
[default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[default6]:[rank6]: return func(*args, **kwargs)
[default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv
[default6]:[rank6]: pg.recv([tensor], group_src_rank, tag).wait()
[default6]:[rank6]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer
[default6]:[rank6]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first):
[default6]:[rank6]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f4948705897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so)
[default6]:[rank6]: frame #1: <unknown function> + 0x5b3a23e (0x7f498222223e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default6]:[rank6]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::string>, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x2c7 (0x7f498221cc87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default6]:[rank6]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f498221cf82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default6]:[rank6]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f498221dfd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default6]:[rank6]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f49821d2371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default6]:[rank6]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f49821d2371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default6]:[rank6]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f49821d2371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default6]:[rank6]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f49821d2371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default6]:[rank6]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f49499df189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default6]:[rank6]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f49499e6610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default6]:[rank6]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector<at::Tensor, std::allocator<at::Tensor> >&, int, int) + 0x5f8 (0x7f4949a05978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default6]:[rank6]: frame #12: <unknown function> + 0x5adc309 (0x7f49821c4309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default6]:[rank6]: frame #13: <unknown function> + 0x5ae6f10 (0x7f49821cef10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default6]:[rank6]: frame #14: <unknown function> + 0x5ae6fa5 (0x7f49821cefa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default6]:[rank6]: frame #15: <unknown function> + 0x5124446 (0x7f498180c446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default6]:[rank6]: frame #16: <unknown function> + 0x1acf4b8 (0x7f497e1b74b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default6]:[rank6]: frame #17: <unknown function> + 0x5aee004 (0x7f49821d6004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default6]:[rank6]: frame #18: <unknown function> + 0x5af36b5 (0x7f49821db6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[default6]:[rank6]: frame #19: <unknown function> + 0xd2631e (0x7f4994dc531e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
[default6]:[rank6]: frame #20: <unknown function> + 0x47def4 (0x7f499451cef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
[default6]:[rank6]: frame #21: <unknown function> + 0x1445a6 (0x55a31c6f05a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55a31c6e9a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #23: <unknown function> + 0x150866 (0x55a31c6fc866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55a31c6e5142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55a31c6f0a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #26: PyObject_Call + 0xbc (0x55a31c6fcf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55a31c6e32b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55a31c6f0a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55a31c6e18fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #30: <unknown function> + 0x150582 (0x55a31c6fc582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55a31c6e18fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #32: <unknown function> + 0x150582 (0x55a31c6fc582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55a31c6e18fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #34: <unknown function> + 0x150582 (0x55a31c6fc582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55a31c6e18fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55a31c6e8f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55a31c6fac39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #38: <unknown function> + 0x211239 (0x55a31c7bd239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55a31c6e9a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55a31c6e53e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55a31c6f0a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55a31c6e0c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55a31c6f0a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55a31c6e18fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #45: <unknown function> + 0x150582 (0x55a31c6fc582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #46: PyObject_Call + 0xbc (0x55a31c6fcf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55a31c6e32b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #48: <unknown function> + 0x150582 (0x55a31c6fc582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #49: PyObject_Call + 0xbc (0x55a31c6fcf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55a31c6e32b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55a31c6f0a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55a31c6e9007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55a31c6fac39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #54: <unknown function> + 0x211239 (0x55a31c7bd239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #55: PyObject_Call + 0x207 (0x55a31c6fd067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55a31c6e32b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #57: <unknown function> + 0x150582 (0x55a31c6fc582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55a31c6e18fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #59: <unknown function> + 0x150582 (0x55a31c6fc582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #60: PyObject_Call + 0xbc (0x55a31c6fcf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55a31c6e32b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #62: <unknown function> + 0x150582 (0x55a31c6fc582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: frame #63: PyObject_Call + 0xbc (0x55a31c6fcf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10)
[default6]:[rank6]: . This may indicate a possible application crash on rank 0 or a network set up issue.
W0703 21:28:16.051000 140016841295680 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 222188 closing signal SIGTERM
W0703 21:28:16.051000 140016841295680 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 222189 closing signal SIGTERM
W0703 21:28:16.052000 140016841295680 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 222190 closing signal SIGTERM
W0703 21:28:16.052000 140016841295680 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 222191 closing signal SIGTERM
E0703 21:28:16.772000 140016841295680 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 222184) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10
Traceback (most recent call last):
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main
run(args)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2024-07-03_21:28:16
host : ip-26-0-174-36.ec2.internal
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 222185)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2024-07-03_21:28:16
host : ip-26-0-174-36.ec2.internal
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 222186)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2024-07-03_21:28:16
host : ip-26-0-174-36.ec2.internal
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 222187)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-07-03_21:28:16
host : ip-26-0-174-36.ec2.internal
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 222184)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
srun: error: ip-26-0-174-36: task 0: Exited with exit code 1
Consider using `hf_transfer` for faster uploads. This solution comes with some limitations. See https://huggingface.co/docs/huggingface_hub/hf_transfer for more details.