|
======================== |
|
START TIME: Tue Jul 2 16:18:31 UTC 2024 |
|
python3 version = Python 3.10.14 |
|
======================== |
|
The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well. |
|
Token is valid (permission: write). |
|
Your token has been saved to /admin/home/ferdinand_mom/.cache/huggingface/token |
|
Login successful |
|
Already on 'bench_cluster' |
|
M examples/config_tiny_llama.py |
|
M examples/config_tiny_llama.yaml |
|
M examples/train_tiny_llama.sh |
|
M src/nanotron/models/llama.py |
|
M src/nanotron/trainer.py |
|
Your branch is up to date with 'origin/bench_cluster'. |
|
Job status: RUNNING |
|
W0702 16:18:34.298000 140256412628800 torch/distributed/run.py:757] |
|
W0702 16:18:34.298000 140256412628800 torch/distributed/run.py:757] ***************************************** |
|
W0702 16:18:34.298000 140256412628800 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. |
|
W0702 16:18:34.298000 140256412628800 torch/distributed/run.py:757] ***************************************** |
|
W0702 16:18:34.299000 140637734872896 torch/distributed/run.py:757] |
|
W0702 16:18:34.299000 140637734872896 torch/distributed/run.py:757] ***************************************** |
|
W0702 16:18:34.299000 140637734872896 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. |
|
W0702 16:18:34.299000 140637734872896 torch/distributed/run.py:757] ***************************************** |
|
[default0]:07/02/2024 16:18:52 [WARNING|DP=0|PP=0|TP=0|ip-26-0-171-62]: [Vocab Size Padding] Padded vocab (size: 50257) with 1 dummy tokens (new size: 50258) |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: Config: |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: Config(general=GeneralArgs(project='bench_cluster', |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: run='%date_%jobid', |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: seed=42, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: step=None, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: consumed_train_samples=None, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: benchmark_csv_path=None, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: ignore_sanity_checks=True), |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: parallelism=ParallelismArgs(dp=2, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: pp=4, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: tp=2, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: pp_engine=<nanotron.parallel.pipeline_parallel.engine.OneForwardOneBackwardPipelineEngine object at 0x7ff51fbd0910>, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: tp_mode=<TensorParallelLinearMode.REDUCE_SCATTER: 2>, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: tp_linear_async_communication=False, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: expert_parallel_size=1), |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: model=ModelArgs(model_config=LlamaConfig(bos_token_id=1, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: eos_token_id=2, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: hidden_act='silu', |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: hidden_size=2048, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: initializer_range=0.02, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: intermediate_size=4096, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: is_llama_config=True, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: max_position_embeddings=4096, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: num_attention_heads=32, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: num_hidden_layers=24, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: num_key_value_heads=32, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: pad_token_id=None, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: pretraining_tp=1, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: rms_norm_eps=1e-05, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: rope_scaling=None, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: rope_theta=10000.0, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: tie_word_embeddings=True, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: use_cache=True, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: vocab_size=50258), |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: init_method=RandomInit(std=0.025), |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: dtype=torch.bfloat16, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: make_vocab_size_divisible_by=1, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: ddp_bucket_cap_mb=25), |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: tokenizer=TokenizerArgs(tokenizer_name_or_path='openai-community/gpt2', |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: tokenizer_revision=None, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: tokenizer_max_length=None), |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: checkpoints=CheckpointsArgs(checkpoints_path=Path('/dev/null'), |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: checkpoint_interval=100000, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: save_initial_state=False, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: resume_checkpoint_path=None, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: checkpoints_path_is_shared_file_system=False), |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: logging=LoggingArgs(log_level='info', |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: log_level_replica='info', |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: iteration_step_info_interval=1), |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: tokens=TokensArgs(sequence_length=4096, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: train_steps=20, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: micro_batch_size=64, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: batch_accumulation_per_replica=8, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: val_check_interval=-1, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: limit_val_batches=0, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: limit_test_batches=0), |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: optimizer=OptimizerArgs(optimizer_factory=AdamWOptimizerArgs(adam_eps=1e-08, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: adam_beta1=0.9, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: adam_beta2=0.95, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: torch_adam_is_fused=True, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: name='adamW'), |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: zero_stage=1, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: weight_decay=0.01, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: clip_grad=1.0, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: accumulate_grad_in_fp32=True, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: learning_rate_scheduler=LRSchedulerArgs(learning_rate=0.0001, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: lr_warmup_steps=1, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: lr_warmup_style='linear', |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: lr_decay_style='linear', |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: lr_decay_steps=19, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: lr_decay_starting_step=None, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: min_decay_lr=1e-05)), |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: data_stages=[DatasetStageArgs(name='Training Stage', |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: start_training_step=1, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: data=DataArgs(dataset=PretrainDatasetsArgs(hf_dataset_or_datasets='roneneldan/TinyStories', |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: hf_dataset_splits='train', |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: hf_dataset_config_name=None, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: dataset_processing_num_proc_per_process=64, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: dataset_overwrite_cache=False, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: text_column_name='text'), |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: seed=42, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: num_loading_workers=32))], |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: profiler=ProfilerArgs(profiler_export_path=Path('/fsx/ferdinandmom/ferdinand-hf/bench_cluster/results/llama-1B/16_GPUS/dp-2_tp-2_pp-4_mbz-64')), |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: lighteval=None) |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: Model Config: |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: LlamaConfig(bos_token_id=1, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: eos_token_id=2, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: hidden_act='silu', |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: hidden_size=2048, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: initializer_range=0.02, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: intermediate_size=4096, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: is_llama_config=True, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: max_position_embeddings=4096, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: num_attention_heads=32, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: num_hidden_layers=24, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: num_key_value_heads=32, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: pad_token_id=None, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: pretraining_tp=1, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: rms_norm_eps=1e-05, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: rope_scaling=None, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: rope_theta=10000.0, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: tie_word_embeddings=True, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: use_cache=True, |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: vocab_size=50258) |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: Building model.. |
|
[default0]:07/02/2024 16:18:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: Setting PP block ranks... |
|
[default3]:07/02/2024 16:19:04 [INFO|DP=1|PP=2|TP=1|ip-26-0-171-88]: No checkpoint path provided. |
|
[default6]:07/02/2024 16:19:04 [INFO|DP=1|PP=3|TP=0|ip-26-0-171-88]: No checkpoint path provided. |
|
[default2]:07/02/2024 16:19:04 [INFO|DP=1|PP=2|TP=0|ip-26-0-171-88]: No checkpoint path provided. |
|
[default7]:07/02/2024 16:19:04 [INFO|DP=1|PP=3|TP=1|ip-26-0-171-88]: No checkpoint path provided. |
|
[default2]:07/02/2024 16:19:04 [INFO|DP=1|PP=0|TP=0|ip-26-0-171-62]: No checkpoint path provided. |
|
[default7]:07/02/2024 16:19:04 [INFO|DP=1|PP=1|TP=1|ip-26-0-171-62]: No checkpoint path provided. |
|
[default3]:07/02/2024 16:19:04 [INFO|DP=1|PP=0|TP=1|ip-26-0-171-62]: No checkpoint path provided. |
|
[default6]:07/02/2024 16:19:04 [INFO|DP=1|PP=1|TP=0|ip-26-0-171-62]: No checkpoint path provided. |
|
[default5]:07/02/2024 16:19:05 [INFO|DP=0|PP=1|TP=1|ip-26-0-171-62]: Local number of parameters: 147M (280.05MiB) |
|
[default1]:07/02/2024 16:19:05 [INFO|DP=0|PP=2|TP=1|ip-26-0-171-88]: Local number of parameters: 126M (240.05MiB) |
|
[default1]:07/02/2024 16:19:05 [INFO|DP=0|PP=2|TP=1|ip-26-0-171-88]: [After model building] Memory usage: 246.06MiB. Peak allocated: 248.09MiB Peak reserved: 262.00MiB |
|
[default1]:07/02/2024 16:19:05 [INFO|DP=0|PP=2|TP=1|ip-26-0-171-88]: No checkpoint path provided. |
|
[default4]:07/02/2024 16:19:05 [INFO|DP=0|PP=3|TP=0|ip-26-0-171-88]: Local number of parameters: 135M (258.20MiB) |
|
[default4]:07/02/2024 16:19:05 [INFO|DP=0|PP=3|TP=0|ip-26-0-171-88]: [After model building] Memory usage: 262.21MiB. Peak allocated: 264.24MiB Peak reserved: 280.00MiB |
|
[default4]:07/02/2024 16:19:05 [INFO|DP=0|PP=3|TP=0|ip-26-0-171-88]: No checkpoint path provided. |
|
[default5]:07/02/2024 16:19:05 [INFO|DP=0|PP=3|TP=1|ip-26-0-171-88]: Local number of parameters: 135M (258.20MiB) |
|
[default5]:07/02/2024 16:19:05 [INFO|DP=0|PP=3|TP=1|ip-26-0-171-88]: [After model building] Memory usage: 262.21MiB. Peak allocated: 264.24MiB Peak reserved: 280.00MiB |
|
[default5]:07/02/2024 16:19:05 [INFO|DP=0|PP=3|TP=1|ip-26-0-171-88]: No checkpoint path provided. |
|
[default0]:07/02/2024 16:19:05 [INFO|DP=0|PP=2|TP=0|ip-26-0-171-88]: Local number of parameters: 126M (240.05MiB) |
|
[default0]:07/02/2024 16:19:05 [INFO|DP=0|PP=2|TP=0|ip-26-0-171-88]: [After model building] Memory usage: 246.06MiB. Peak allocated: 248.09MiB Peak reserved: 262.00MiB |
|
[default0]:07/02/2024 16:19:05 [INFO|DP=0|PP=2|TP=0|ip-26-0-171-88]: No checkpoint path provided. |
|
[default1]:07/02/2024 16:19:05 [INFO|DP=0|PP=0|TP=1|ip-26-0-171-62]: Local number of parameters: 198M (378.21MiB) |
|
[default1]:07/02/2024 16:19:05 [INFO|DP=0|PP=0|TP=1|ip-26-0-171-62]: [After model building] Memory usage: 385.23MiB. Peak allocated: 387.26MiB Peak reserved: 402.00MiB |
|
[default4]:07/02/2024 16:19:05 [INFO|DP=0|PP=1|TP=0|ip-26-0-171-62]: Local number of parameters: 147M (280.05MiB) |
|
[default4]:07/02/2024 16:19:05 [INFO|DP=0|PP=1|TP=0|ip-26-0-171-62]: [After model building] Memory usage: 287.07MiB. Peak allocated: 289.10MiB Peak reserved: 302.00MiB |
|
[default1]:07/02/2024 16:19:05 [INFO|DP=0|PP=0|TP=1|ip-26-0-171-62]: No checkpoint path provided. |
|
[default4]:07/02/2024 16:19:05 [INFO|DP=0|PP=1|TP=0|ip-26-0-171-62]: No checkpoint path provided. |
|
[default0]:07/02/2024 16:19:05 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: Total number of parameters: 1.21G (2313.02MiB) |
|
[default0]:07/02/2024 16:19:05 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: Local number of parameters: 198M (378.21MiB) |
|
[default0]:07/02/2024 16:19:05 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: [After model building] Memory usage: 385.23MiB. Peak allocated: 387.26MiB Peak reserved: 402.00MiB |
|
[default0]:07/02/2024 16:19:05 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: No checkpoint path provided. |
|
[default0]:07/02/2024 16:19:05 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: Parametrizing model parameters using StandardParametrizator |
|
[default5]:07/02/2024 16:19:05 [INFO|DP=0|PP=1|TP=1|ip-26-0-171-62]: [After model building] Memory usage: 287.07MiB. Peak allocated: 289.10MiB Peak reserved: 302.00MiB |
|
[default5]:07/02/2024 16:19:05 [INFO|DP=0|PP=1|TP=1|ip-26-0-171-62]: No checkpoint path provided. |
|
[default0]:07/02/2024 16:19:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: [Optimizer Building] Using LearningRateForSP as learning rate |
|
[default0]:07/02/2024 16:19:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: [ZeRO sharding] Size of optimizer params per rank: |
|
[default0]:07/02/2024 16:19:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: [ZeRO sharding] DP Rank 0 has 99.1M out of 198M (50.00%) params' optimizer states |
|
[default0]:07/02/2024 16:19:07 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: [ZeRO sharding] DP Rank 1 has 99.1M out of 198M (50.00%) params' optimizer states |
|
[default0]:07/02/2024 16:19:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: [Training Plan] Stage Training Stage has 19 remaining training steps and has consumed 0 samples |
|
[default0]:07/02/2024 16:19:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: Using `datasets` library |
|
[default0]:07/02/2024 16:19:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: Loading tokenizer from openai-community/gpt2 and transformers/hf_hub versions ('4.41.2', '0.23.4') |
|
[default0]:07/02/2024 16:19:08 [WARNING|DP=0|PP=0|TP=0|ip-26-0-171-62]: Repo card metadata block was not found. Setting CardData to empty. |
|
[default0]:Repo card metadata block was not found. Setting CardData to empty. |
|
[default0]:07/02/2024 16:19:09 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: [Training Plan] There are 1 training stages |
|
[default0]:07/02/2024 16:19:09 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: [Stage Training Stage] start from step 1 |
|
[default0]:07/02/2024 16:19:09 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: |
|
[default0]:07/02/2024 16:19:09 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: [Start training] datetime: 2024-07-02 16:19:09.495732 | mbs: 64 | grad_accum: 8 | global_batch_size: 1024 | sequence_length: 4096 | train_steps: 20 | start_iteration_step: 0 | consumed_train_samples: 0 |
|
[default0]:07/02/2024 16:19:09 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: Resuming training from stage Training Stage, it has trained for 0 samples and has 19 remaining train steps |
|
[default0]:07/02/2024 16:19:09 [INFO|DP=0|PP=0|TP=0|ip-26-0-171-62]: Memory usage: 1519.87MiB. Peak allocated 1519.87MiB. Peak reserved: 1540.00MiB |
|
[default4]:07/02/2024 16:19:09 [WARNING|DP=0|PP=3|TP=0|ip-26-0-171-88]: Repo card metadata block was not found. Setting CardData to empty. |
|
[default1]:07/02/2024 16:19:09 [WARNING|DP=0|PP=2|TP=1|ip-26-0-171-88]: Repo card metadata block was not found. Setting CardData to empty. |
|
[default5]:07/02/2024 16:19:09 [WARNING|DP=0|PP=3|TP=1|ip-26-0-171-88]: Repo card metadata block was not found. Setting CardData to empty. |
|
[default4]:Repo card metadata block was not found. Setting CardData to empty. |
|
[default2]:07/02/2024 16:19:09 [WARNING|DP=1|PP=2|TP=0|ip-26-0-171-88]: Repo card metadata block was not found. Setting CardData to empty. |
|
[default7]:07/02/2024 16:19:09 [WARNING|DP=1|PP=3|TP=1|ip-26-0-171-88]: Repo card metadata block was not found. Setting CardData to empty. |
|
[default0]:07/02/2024 16:19:09 [WARNING|DP=0|PP=2|TP=0|ip-26-0-171-88]: Repo card metadata block was not found. Setting CardData to empty. |
|
[default1]:Repo card metadata block was not found. Setting CardData to empty. |
|
[default7]:Repo card metadata block was not found. Setting CardData to empty. |
|
[default2]:Repo card metadata block was not found. Setting CardData to empty. |
|
[default0]:Repo card metadata block was not found. Setting CardData to empty. |
|
[default5]:Repo card metadata block was not found. Setting CardData to empty. |
|
[default4]:07/02/2024 16:19:09 [WARNING|DP=0|PP=1|TP=0|ip-26-0-171-62]: Repo card metadata block was not found. Setting CardData to empty. |
|
[default1]:07/02/2024 16:19:09 [WARNING|DP=0|PP=0|TP=1|ip-26-0-171-62]: Repo card metadata block was not found. Setting CardData to empty. |
|
[default2]:07/02/2024 16:19:09 [WARNING|DP=1|PP=0|TP=0|ip-26-0-171-62]: Repo card metadata block was not found. Setting CardData to empty. |
|
[default7]:07/02/2024 16:19:09 [WARNING|DP=1|PP=1|TP=1|ip-26-0-171-62]: Repo card metadata block was not found. Setting CardData to empty. |
|
[default7]:Repo card metadata block was not found. Setting CardData to empty. |
|
[default1]:Repo card metadata block was not found. Setting CardData to empty. |
|
[default3]:07/02/2024 16:19:09 [WARNING|DP=1|PP=0|TP=1|ip-26-0-171-62]: Repo card metadata block was not found. Setting CardData to empty. |
|
[default5]:07/02/2024 16:19:09 [WARNING|DP=0|PP=1|TP=1|ip-26-0-171-62]: Repo card metadata block was not found. Setting CardData to empty. |
|
[default6]:07/02/2024 16:19:09 [WARNING|DP=1|PP=1|TP=0|ip-26-0-171-62]: Repo card metadata block was not found. Setting CardData to empty. |
|
[default2]:Repo card metadata block was not found. Setting CardData to empty. |
|
[default5]:Repo card metadata block was not found. Setting CardData to empty. |
|
[default4]:Repo card metadata block was not found. Setting CardData to empty. |
|
[default3]:Repo card metadata block was not found. Setting CardData to empty. |
|
[default6]:Repo card metadata block was not found. Setting CardData to empty. |
|
[default3]:07/02/2024 16:19:09 [WARNING|DP=1|PP=2|TP=1|ip-26-0-171-88]: Repo card metadata block was not found. Setting CardData to empty. |
|
[default6]:07/02/2024 16:19:09 [WARNING|DP=1|PP=3|TP=0|ip-26-0-171-88]: Repo card metadata block was not found. Setting CardData to empty. |
|
[default6]:Repo card metadata block was not found. Setting CardData to empty. |
|
[default3]:Repo card metadata block was not found. Setting CardData to empty. |
|
[default2]:[rank2]: OSError: [Errno 122] Disk quota exceeded |
|
[default2]: |
|
[default2]:[rank2]: During handling of the above exception, another exception occurred: |
|
[default2]: |
|
[default2]:[rank2]: Traceback (most recent call last): |
|
[default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in <module> |
|
[default2]:[rank2]: trainer.train(dataloader) |
|
[default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train |
|
[default2]:[rank2]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) |
|
[default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step |
|
[default2]:[rank2]: outputs = self.pipeline_engine.train_batch_iter( |
|
[default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter |
|
[default2]:[rank2]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) |
|
[default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward |
|
[default2]:[rank2]: output = model(**micro_batch) |
|
[default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default2]:[rank2]: return self._call_impl(*args, **kwargs) |
|
[default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default2]:[rank2]: return forward_call(*args, **kwargs) |
|
[default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward |
|
[default2]:[rank2]: sharded_logits = self.model( |
|
[default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default2]:[rank2]: return self._call_impl(*args, **kwargs) |
|
[default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default2]:[rank2]: return forward_call(*args, **kwargs) |
|
[default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward |
|
[default2]:[rank2]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] |
|
[default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states |
|
[default2]:[rank2]: hidden_encoder_states = encoder_block(**hidden_encoder_states) |
|
[default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default2]:[rank2]: return self._call_impl(*args, **kwargs) |
|
[default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default2]:[rank2]: return forward_call(*args, **kwargs) |
|
[default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward |
|
[default2]:[rank2]: output = self.pp_block(**new_kwargs) |
|
[default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default2]:[rank2]: return self._call_impl(*args, **kwargs) |
|
[default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default2]:[rank2]: return forward_call(*args, **kwargs) |
|
[default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward |
|
[default2]:[rank2]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) |
|
[default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default2]:[rank2]: return self._call_impl(*args, **kwargs) |
|
[default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default2]:[rank2]: return forward_call(*args, **kwargs) |
|
[default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 566, in forward |
|
[default2]:[rank2]: query_states, key_value_states = self.flash_rotary_embedding(query_states, kv=key_value_states) |
|
[default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default2]:[rank2]: return self._call_impl(*args, **kwargs) |
|
[default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default2]:[rank2]: return forward_call(*args, **kwargs) |
|
[default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/flash_attn/layers/rotary.py", line 457, in forward |
|
[default2]:[rank2]: q = apply_rotary_emb_func( |
|
[default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/flash_attn/layers/rotary.py", line 122, in apply_rotary_emb |
|
[default2]:[rank2]: return ApplyRotaryEmb.apply( |
|
[default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 598, in apply |
|
[default2]:[rank2]: return super().apply(*args, **kwargs) # type: ignore[misc] |
|
[default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/flash_attn/layers/rotary.py", line 48, in forward |
|
[default2]:[rank2]: out = apply_rotary( |
|
[default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/flash_attn/ops/triton/rotary.py", line 202, in apply_rotary |
|
[default2]:[rank2]: rotary_kernel[grid]( |
|
[default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/triton/runtime/jit.py", line 167, in <lambda> |
|
[default2]:[rank2]: return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) |
|
[default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/triton/runtime/jit.py", line 416, in run |
|
[default2]:[rank2]: self.cache[device][key] = compile( |
|
[default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/triton/compiler/compiler.py", line 194, in compile |
|
[default2]:[rank2]: metadata_group[f"{src.name}.{ext}"] = fn_cache_manager.put(next_module, f"{src.name}.{ext}") |
|
[default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/triton/runtime/cache.py", line 123, in put |
|
[default2]:[rank2]: with open(temp_path, mode) as f: |
|
[default2]:[rank2]: OSError: [Errno 122] Disk quota exceeded |
|
[default1]:[rank1]: Traceback (most recent call last): |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in <module> |
|
[default1]:[rank1]: trainer.train(dataloader) |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train |
|
[default1]:[rank1]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step |
|
[default1]:[rank1]: outputs = self.pipeline_engine.train_batch_iter( |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter |
|
[default1]:[rank1]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward |
|
[default1]:[rank1]: output = model(**micro_batch) |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default1]:[rank1]: return self._call_impl(*args, **kwargs) |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default1]:[rank1]: return forward_call(*args, **kwargs) |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward |
|
[default1]:[rank1]: sharded_logits = self.model( |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default1]:[rank1]: return self._call_impl(*args, **kwargs) |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default1]:[rank1]: return forward_call(*args, **kwargs) |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward |
|
[default1]:[rank1]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states |
|
[default1]:[rank1]: hidden_encoder_states = encoder_block(**hidden_encoder_states) |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default1]:[rank1]: return self._call_impl(*args, **kwargs) |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default1]:[rank1]: return forward_call(*args, **kwargs) |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward |
|
[default1]:[rank1]: output = self.pp_block(**new_kwargs) |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default1]:[rank1]: return self._call_impl(*args, **kwargs) |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default1]:[rank1]: return forward_call(*args, **kwargs) |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward |
|
[default1]:[rank1]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default1]:[rank1]: return self._call_impl(*args, **kwargs) |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default1]:[rank1]: return forward_call(*args, **kwargs) |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 565, in forward |
|
[default1]:[rank1]: key_value_states = key_value_states.permute(1, 2, 0, 3, 4).contiguous() |
|
[default1]:[rank1]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1024.00 MiB. GPU has a total capacity of 79.33 GiB of which 355.94 MiB is free. Including non-PyTorch memory, this process has 78.96 GiB memory in use. Of the allocated memory 70.62 GiB is allocated by PyTorch, and 1.42 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) |
|
[default0]:[rank0]: Traceback (most recent call last): |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in <module> |
|
[default0]:[rank0]: trainer.train(dataloader) |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train |
|
[default0]:[rank0]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step |
|
[default0]:[rank0]: outputs = self.pipeline_engine.train_batch_iter( |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter |
|
[default0]:[rank0]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward |
|
[default0]:[rank0]: output = model(**micro_batch) |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default0]:[rank0]: return self._call_impl(*args, **kwargs) |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default0]:[rank0]: return forward_call(*args, **kwargs) |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward |
|
[default0]:[rank0]: sharded_logits = self.model( |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default0]:[rank0]: return self._call_impl(*args, **kwargs) |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default0]:[rank0]: return forward_call(*args, **kwargs) |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward |
|
[default0]:[rank0]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states |
|
[default0]:[rank0]: hidden_encoder_states = encoder_block(**hidden_encoder_states) |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default0]:[rank0]: return self._call_impl(*args, **kwargs) |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default0]:[rank0]: return forward_call(*args, **kwargs) |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward |
|
[default0]:[rank0]: output = self.pp_block(**new_kwargs) |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default0]:[rank0]: return self._call_impl(*args, **kwargs) |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default0]:[rank0]: return forward_call(*args, **kwargs) |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 631, in forward |
|
[default0]:[rank0]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default0]:[rank0]: return self._call_impl(*args, **kwargs) |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default0]:[rank0]: return forward_call(*args, **kwargs) |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 565, in forward |
|
[default0]:[rank0]: key_value_states = key_value_states.permute(1, 2, 0, 3, 4).contiguous() |
|
[default0]:[rank0]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1024.00 MiB. GPU |
|
[default4]:[rank12]: Traceback (most recent call last): |
|
[default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in <module> |
|
[default4]:[rank12]: trainer.train(dataloader) |
|
[default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train |
|
[default4]:[rank12]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) |
|
[default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step |
|
[default4]:[rank12]: outputs = self.pipeline_engine.train_batch_iter( |
|
[default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter |
|
[default4]:[rank12]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) |
|
[default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward |
|
[default4]:[rank12]: output = model(**micro_batch) |
|
[default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default4]:[rank12]: return self._call_impl(*args, **kwargs) |
|
[default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default4]:[rank12]: return forward_call(*args, **kwargs) |
|
[default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward |
|
[default4]:[rank12]: sharded_logits = self.model( |
|
[default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default4]:[rank12]: return self._call_impl(*args, **kwargs) |
|
[default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default4]:[rank12]: return forward_call(*args, **kwargs) |
|
[default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward |
|
[default4]:[rank12]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] |
|
[default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states |
|
[default4]:[rank12]: hidden_encoder_states = encoder_block(**hidden_encoder_states) |
|
[default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default4]:[rank12]: return self._call_impl(*args, **kwargs) |
|
[default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default4]:[rank12]: return forward_call(*args, **kwargs) |
|
[default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward |
|
[default4]:[rank12]: new_kwargs[name] = recv_from_pipeline_state_buffer( |
|
[default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer |
|
[default4]:[rank12]: pipeline_state.run_communication() |
|
[default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication |
|
[default4]:[rank12]: recv_activation_tensor = recv_activation() |
|
[default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ |
|
[default4]:[rank12]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] |
|
[default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors |
|
[default4]:[rank12]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) |
|
[default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors |
|
[default4]:[rank12]: meta = self._recv_meta(from_rank=from_rank, tag=tag) |
|
[default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta |
|
[default4]:[rank12]: dist.recv( |
|
[default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper |
|
[default4]:[rank12]: return func(*args, **kwargs) |
|
[default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv |
|
[default4]:[rank12]: pg.recv([tensor], group_src_rank, tag).wait() |
|
[default4]:[rank12]: torch.distributed.DistBackendError: [3] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '2:3', but store->get('2:3') got error: Connection reset by peer |
|
[default4]:[rank12]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): |
|
[default4]:[rank12]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f4fec464897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) |
|
[default4]:[rank12]: frame #1: <unknown function> + 0x5b3a23e (0x7f5025f8123e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default4]:[rank12]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::string>, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x2c7 (0x7f5025f7bc87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default4]:[rank12]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f5025f7bf82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default4]:[rank12]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f5025f7cfd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default4]:[rank12]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f5025f31371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default4]:[rank12]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f5025f31371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default4]:[rank12]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f5025f31371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default4]:[rank12]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f5025f31371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default4]:[rank12]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f4fed73e189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) |
|
[default4]:[rank12]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f4fed745610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) |
|
[default4]:[rank12]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector<at::Tensor, std::allocator<at::Tensor> >&, int, int) + 0x5f8 (0x7f4fed764978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) |
|
[default4]:[rank12]: frame #12: <unknown function> + 0x5adc309 (0x7f5025f23309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default4]:[rank12]: frame #13: <unknown function> + 0x5ae6f10 (0x7f5025f2df10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default4]:[rank12]: frame #14: <unknown function> + 0x5ae6fa5 (0x7f5025f2dfa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default4]:[rank12]: frame #15: <unknown function> + 0x5124446 (0x7f502556b446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default4]:[rank12]: frame #16: <unknown function> + 0x1acf4b8 (0x7f5021f164b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default4]:[rank12]: frame #17: <unknown function> + 0x5aee004 (0x7f5025f35004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default4]:[rank12]: frame #18: <unknown function> + 0x5af36b5 (0x7f5025f3a6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default4]:[rank12]: frame #19: <unknown function> + 0xd2631e (0x7f5038b2431e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) |
|
[default4]:[rank12]: frame #20: <unknown function> + 0x47def4 (0x7f503827bef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) |
|
[default4]:[rank12]: frame #21: <unknown function> + 0x1445a6 (0x5585bedd85a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank12]: frame #22: _PyObject_MakeTpCall + 0x26b (0x5585bedd1a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank12]: frame #23: <unknown function> + 0x150866 (0x5585bede4866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank12]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x5585bedcd142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank12]: frame #25: _PyFunction_Vectorcall + 0x6c (0x5585bedd8a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank12]: frame #26: PyObject_Call + 0xbc (0x5585bede4f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank12]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5585bedcb2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank12]: frame #28: _PyFunction_Vectorcall + 0x6c (0x5585bedd8a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank12]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5585bedc98fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank12]: frame #30: <unknown function> + 0x150582 (0x5585bede4582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank12]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5585bedc98fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank12]: frame #32: <unknown function> + 0x150582 (0x5585bede4582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank12]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5585bedc98fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank12]: frame #34: <unknown function> + 0x150582 (0x5585bede4582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank12]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5585bedc98fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank12]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x5585bedd0f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank12]: frame #37: _PyObject_Call_Prepend + 0x69 (0x5585bede2c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank12]: frame #38: <unknown function> + 0x211239 (0x5585beea5239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank12]: frame #39: _PyObject_MakeTpCall + 0x26b (0x5585bedd1a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank12]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5585bedcd3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank12]: frame #41: _PyFunction_Vectorcall + 0x6c (0x5585bedd8a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank12]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x5585bedc8c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank12]: frame #43: _PyFunction_Vectorcall + 0x6c (0x5585bedd8a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank12]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5585bedc98fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank12]: frame #45: <unknown function> + 0x150582 (0x5585bede4582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank12]: frame #46: PyObject_Call + 0xbc (0x5585bede4f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank12]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5585bedcb2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank12]: frame #48: <unknown function> + 0x150582 (0x5585bede4582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank12]: frame #49: PyObject_Call + 0xbc (0x5585bede4f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank12]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5585bedcb2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank12]: frame #51: _PyFunction_Vectorcall + 0x6c (0x5585bedd8a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank12]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x5585bedd1007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank12]: frame #53: _PyObject_Call_Prepend + 0x69 (0x5585bede2c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank12]: frame #54: <unknown function> + 0x211239 (0x5585beea5239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank12]: frame #55: PyObject_Call + 0x207 (0x5585bede5067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank12]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5585bedcb2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank12]: frame #57: <unknown function> + 0x150582 (0x5585bede4582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank12]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5585bedc98fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank12]: frame #59: <unknown function> + 0x150582 (0x5585bede4582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank12]: frame #60: PyObject_Call + 0xbc (0x5585bede4f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank12]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5585bedcb2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank12]: frame #62: <unknown function> + 0x150582 (0x5585bede4582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank12]: frame #63: PyObject_Call + 0xbc (0x5585bede4f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank12]: . This may indicate a possible application crash on rank 0 or a network set up issue. |
|
[default0]:[rank8]: Traceback (most recent call last): |
|
[default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in <module> |
|
[default0]:[rank8]: trainer.train(dataloader) |
|
[default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train |
|
[default0]:[rank8]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) |
|
[default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step |
|
[default0]:[rank8]: outputs = self.pipeline_engine.train_batch_iter( |
|
[default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter |
|
[default0]:[rank8]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) |
|
[default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward |
|
[default0]:[rank8]: output = model(**micro_batch) |
|
[default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default0]:[rank8]: return self._call_impl(*args, **kwargs) |
|
[default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default0]:[rank8]: return forward_call(*args, **kwargs) |
|
[default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward |
|
[default0]:[rank8]: sharded_logits = self.model( |
|
[default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default0]:[rank8]: return self._call_impl(*args, **kwargs) |
|
[default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default0]:[rank8]: return forward_call(*args, **kwargs) |
|
[default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward |
|
[default0]:[rank8]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] |
|
[default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states |
|
[default0]:[rank8]: hidden_encoder_states = encoder_block(**hidden_encoder_states) |
|
[default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default0]:[rank8]: return self._call_impl(*args, **kwargs) |
|
[default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default6]:[rank14]: Traceback (most recent call last): |
|
[default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in <module> |
|
[default6]:[rank14]: trainer.train(dataloader) |
|
[default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train |
|
[default0]:[rank8]: return forward_call(*args, **kwargs) |
|
[default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward |
|
[default6]:[rank14]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) |
|
[default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step |
|
[default0]:[rank8]: new_kwargs[name] = recv_from_pipeline_state_buffer( |
|
[default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer |
|
[default0]:[rank8]: pipeline_state.run_communication() |
|
[default1]:[rank9]: Traceback (most recent call last): |
|
[default6]:[rank14]: outputs = self.pipeline_engine.train_batch_iter( |
|
[default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter |
|
[default2]:[rank10]: Traceback (most recent call last): |
|
[default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in <module> |
|
[default6]:[rank14]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) |
|
[default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward |
|
[default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication |
|
[default0]:[rank8]: recv_activation_tensor = recv_activation() |
|
[default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ |
|
[default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in <module> |
|
[default0]:[rank8]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] |
|
[default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors |
|
[default1]:[rank9]: trainer.train(dataloader) |
|
[default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train |
|
[default0]:[rank8]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) |
|
[default2]:[rank10]: trainer.train(dataloader) |
|
[default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train |
|
[default2]:[rank10]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) |
|
[default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step |
|
[default2]:[rank10]: outputs = self.pipeline_engine.train_batch_iter( |
|
[default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter |
|
[default2]:[rank10]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) |
|
[default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward |
|
[default2]:[rank10]: output = model(**micro_batch) |
|
[default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default2]:[rank10]: return self._call_impl(*args, **kwargs) |
|
[default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default2]:[rank10]: return forward_call(*args, **kwargs) |
|
[default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward |
|
[default2]:[rank10]: sharded_logits = self.model( |
|
[default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default2]:[rank10]: return self._call_impl(*args, **kwargs) |
|
[default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default2]:[rank10]: return forward_call(*args, **kwargs) |
|
[default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward |
|
[default2]:[rank10]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] |
|
[default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states |
|
[default2]:[rank10]: hidden_encoder_states = encoder_block(**hidden_encoder_states) |
|
[default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default2]:[rank10]: return self._call_impl(*args, **kwargs) |
|
[default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default2]:[rank10]: return forward_call(*args, **kwargs) |
|
[default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward |
|
[default2]:[rank10]: new_kwargs[name] = recv_from_pipeline_state_buffer( |
|
[default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer |
|
[default2]:[rank10]: pipeline_state.run_communication() |
|
[default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication |
|
[default2]:[rank10]: recv_activation_tensor = recv_activation() |
|
[default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ |
|
[default2]:[rank10]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] |
|
[default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors |
|
[default2]:[rank10]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) |
|
[default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors |
|
[default2]:[rank10]: meta = self._recv_meta(from_rank=from_rank, tag=tag) |
|
[default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta |
|
[default2]:[rank10]: dist.recv( |
|
[default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper |
|
[default2]:[rank10]: return func(*args, **kwargs) |
|
[default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv |
|
[default2]:[rank10]: pg.recv([tensor], group_src_rank, tag).wait() |
|
[default2]:[rank10]: torch.distributed.DistBac[default7]:[rank7]: Traceback (most recent call last): |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in <module> |
|
[default7]:[rank7]: trainer.train(dataloader) |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train |
|
[default7]:[rank7]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step |
|
[default7]:[rank7]: outputs = self.pipeline_engine.train_batch_iter( |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter |
|
[default7]:[rank7]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nakendError: [2] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '1:2', but store->get('1:2') got error: Connection reset by peer |
|
[default2]:[rank10]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): |
|
[default2]:[rank10]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f0ee8834897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) |
|
[default2]:[rank10]: frame #1: <unknown function> + 0x5b3a23e (0x7f0f2235123e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default2]:[rank10]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::string>, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x2c7 (0x7f0f2234bc87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default2]:[rank10]: frame #3: c10d::TCPStore:notron/parallel/pipeline_parallel/engine.py", line 44, in forward |
|
[default7]:[rank7]: output = model(**micro_batch) |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default7]:[rank7]: return self._call_impl(*args, **kwargs) |
|
:doGet(std::string const&) + 0x32 (0x7f0f2234bf82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default2]:[rank10]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f0f2234cfd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default2]:[rank10]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f0f22301371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default2]:[rank10]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f0f22301371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default2]:[rank10]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f0f22301371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default2]:[rank10]: frame #8: c10[default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default7]:[rank7]: return forward_call(*args, **kwargs) |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward |
|
[default7]:[rank7]: sharded_logits = self.model( |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default7]:[rank7]: return self._call_impl(*args, **kwargs) |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default7]:[rank7]: return forward_call(*args, **kwargs) |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward |
|
[default7]:[rank7]: d::PrefixStore::get(std::string const&) + 0x31 (0x7f0f22301371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default2]:[rank10]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f0ee9b0e189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) |
|
[default2]:[rank10]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f0ee9b15610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) |
|
[default2]:[rank10]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector<at::Tensor, std::allocator<at::Tensor> >&, int, int) + 0x5f8 (0x7f0ee9b34978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) |
|
[default2]:[rank10]: frame #12: <unknown function> + 0x5adc309 (0x7 return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states |
|
[default7]:[rank7]: hidden_encoder_states = encoder_block(**hidden_encoder_states) |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default7]:[rank7]: return self._call_impl(*args, **kwargs) |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default7]:[rank7]: return forward_call(*args, **kwargs) |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward |
|
[default7]:[rank7]: new_kwargs[name] = recv_from_pipeline_state_buffer( |
|
[defaf0f222f3309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default2]:[rank10]: frame #13: <unknown function> + 0x5ae6f10 (0x7f0f222fdf10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default2]:[rank10]: frame #14: <unknown function> + 0x5ae6fa5 (0x7f0f222fdfa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default2]:[rank10]: frame #15: <unknown function> + 0x5124446 (0x7f0f2193b446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default2]:[rank10]: frame #16: <unknown function> + 0x1acf4b8 (0x7f0f1e2e64b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default2]:[rank10]: frame #17: <unknown function> + 0x5aee004 (0x7f0f22305004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/pytult7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer |
|
[default7]:[rank7]: pipeline_state.run_communication() |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication |
|
[default7]:[rank7]: recv_activation_tensor = recv_activation() |
|
hon3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default2]:[rank10]: frame #18: <unknown function> + 0x5af36b5 (0x7f0f2230a6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default2]:[rank10]: frame #19: <unknown function> + 0xd2631e (0x7f0f34ef431e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) |
|
[default2]:[rank10]: frame #20: <unknown function> + 0x47def4 (0x7f0f3464bef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) |
|
[default2]:[rank10]: frame #21: <unknown function> + 0x1445a6 (0x55b3516545a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank10]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55b35164da6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank10]: frame #23: <unknown function> + 0x150866 (0x55b351660866 in /fsx/ferdinandmom/minif[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ |
|
[default7]:[rank7]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors |
|
[default7]:[rank7]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors |
|
[default7]:[rank7]: meta = self._recv_meta(from_rank=from_rank, tag=tag) |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta |
|
[default7]:[rank7]: dist.recv( |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-clorge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank10]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55b351649142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank10]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55b351654a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank10]: frame #26: PyObject_Call + 0xbc (0x55b351660f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank10]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55b3516472b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank10]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55b351654a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank10]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55b3516458fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank10]: frame #30: <unknown function> + 0x150582 (0x55b351660582 in /uster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper |
|
[default7]:[rank7]: return func(*args, **kwargs) |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv |
|
[default7]:[rank7]: pg.recv([tensor], group_src_rank, tag).wait() |
|
[default7]:[rank7]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer |
|
[default7]:[rank7]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): |
|
[default7]:[rank7]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f522543c897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) |
|
[default7]:[rank7]: frame #1: <unknown function> + 0x5b3a23e (0x7f525ef5923e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank10]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55b3516458fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank10]: frame #32: <unknown function> + 0x150582 (0x55b351660582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank10]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55b3516458fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank10]: frame #34: <unknown function> + 0x150582 (0x55b351660582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank10]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55b3516458fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank10]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55b35164cf50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank10]: frame #37: _PyObject_Calfsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default7]:[rank7]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::string>, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x2c7 (0x7f525ef53c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default7]:[rank7]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f525ef53f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default7]:[rank7]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f525ef54fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default7]:[rank7]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f525ef09371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default7]:[rank7]: frame #6: cl_Prepend + 0x69 (0x55b35165ec39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank10]: frame #38: <unknown function> + 0x211239 (0x55b351721239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank10]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55b35164da6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank10]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55b3516493e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank10]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55b351654a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank10]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55b351644c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank10]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55b351654a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank10]:10d::PrefixStore::get(std::string const&) + 0x31 (0x7f525ef09371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default7]:[rank7]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f525ef09371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default7]:[rank7]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f525ef09371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default7]:[rank7]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f5226716189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) |
|
[default7]:[rank7]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f522671d610 in /fsx/ferdinandmom/minifo frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55b3516458fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank10]: frame #45: <unknown function> + 0x150582 (0x55b351660582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank10]: frame #46: PyObject_Call + 0xbc (0x55b351660f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank10]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55b3516472b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank10]: frame #48: <unknown function> + 0x150582 (0x55b351660582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank10]: frame #49: PyObject_Call + 0xbc (0x55b351660f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank10]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55b3516472b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[defaulrge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) |
|
[default7]:[rank7]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector<at::Tensor, std::allocator<at::Tensor> >&, int, int) + 0x5f8 (0x7f522673c978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) |
|
[default7]:[rank7]: frame #12: <unknown function> + 0x5adc309 (0x7f525eefb309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default7]:[rank7]: frame #13: <unknown function> + 0x5ae6f10 (0x7f525ef05f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default7]:[rank7]: frame #14: <unknown function> + 0x5ae6fa5 (0x7f525ef05fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default7]:[rank7]: frame #15: <unknown function> + 0x5124446 (0x7f525e543446 in /fsx/ferdinandmom/miniforge3/et2]:[rank10]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55b351654a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank10]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55b35164d007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank10]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55b35165ec39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank10]: frame #54: <unknown function> + 0x211239 (0x55b351721239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank10]: frame #55: PyObject_Call + 0x207 (0x55b351661067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank10]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55b3516472b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank10]: frame #57: <unknown function> + 0x150582 (0x55b351660582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bnvs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default7]:[rank7]: frame #16: <unknown function> + 0x1acf4b8 (0x7f525aeee4b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default7]:[rank7]: frame #17: <unknown function> + 0x5aee004 (0x7f525ef0d004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default7]:[rank7]: frame #18: <unknown function> + 0x5af36b5 (0x7f525ef126b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default7]:[rank7]: frame #19: <unknown function> + 0xd2631e (0x7f5271afc31e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) |
|
[default7]:[rank7]: frame #20: <unknown function> + 0x47def4 (0x7f5271253ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.sin/python3.10) |
|
[default2]:[rank10]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55b3516458fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank10]: frame #59: <unknown function> + 0x150582 (0x55b351660582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank10]: frame #60: PyObject_Call + 0xbc (0x55b351660f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank10]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55b3516472b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank10]: frame #62: <unknown function> + 0x150582 (0x55b351660582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank10]: frame #63: PyObject_Call + 0xbc (0x55b351660f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank10]: . This may indicate a possible application crash on rank 0 or a network set up issue. |
|
[default6]:[rank14o) |
|
[default7]:[rank7]: frame #21: <unknown function> + 0x1445a6 (0x55b55302a5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
]: output = model(**micro_batch) |
|
[default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default6]:[rank14]: return self._call_impl(*args, **kwargs) |
|
[default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default6]:[rank14]: return forward_call(*args, **kwargs) |
|
[default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward |
|
[default6]:[rank14]: sharded_logits = self.model( |
|
[default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default6]:[rank14]: return self._call_impl(*args, **kwargs) |
|
[default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/si[default7]:[rank7]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55b553023a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #23: <unknown function> + 0x150866 (0x55b553036866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55b55301f142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55b55302aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #26: PyObject_Call + 0xbc (0x55b553036f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55b55301d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55b55302aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/pyte-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default6]:[rank14]: return forward_call(*args, **kwargs) |
|
[default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward |
|
[default6]:[rank14]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] |
|
[default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states |
|
[default6]:[rank14]: hidden_encoder_states = encoder_block(**hidden_encoder_states) |
|
[default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default6]:[rank14]: return self._call_impl(*args, **kwargs) |
|
[default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[defaulthon3.10) |
|
[default7]:[rank7]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55b55301b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #30: <unknown function> + 0x150582 (0x55b553036582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55b55301b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #32: <unknown function> + 0x150582 (0x55b553036582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55b55301b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
t6]:[rank14]: return forward_call(*args, **kwargs) |
|
[default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward |
|
[default6]:[rank14]: new_kwargs[name] = recv_from_pipeline_state_buffer( |
|
[default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer |
|
[default6]:[rank14]: pipeline_state.run_communication() |
|
[default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication |
|
[default6]:[rank14]: recv_activation_tensor = recv_activation() |
|
[default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ |
|
[default6]:[rank14]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] |
|
[defaul[default7]:[rank7]: frame #34: <unknown function> + 0x150582 (0x55b553036582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55b55301b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55b553022f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55b553034c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #38: <unknown function> + 0x211239 (0x55b5530f7239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55b553023a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55b55301f3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bencht6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors |
|
[default6]:[rank14]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) |
|
[default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors |
|
[default6]:[rank14]: meta = self._recv_meta(from_rank=from_rank, tag=tag) |
|
[default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta |
|
[default6]:[rank14]: dist.recv( |
|
[default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper |
|
[default6]:[rank14]: return func(*args, **kwargs) |
|
[default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packag-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55b55302aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55b55301ac5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55b55302aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55b55301b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #45: <unknown function> + 0x150582 (0x55b553036582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #46: PyObject_Call + 0xbc (0x55b553036f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55b55301d2b3 in /fsx/ferdinandmom/miniforge3/es/torch/distributed/distributed_c10d.py", line 1932, in recv |
|
[default6]:[rank14]: pg.recv([tensor], group_src_rank, tag).wait() |
|
[default6]:[rank14]: torch.distributed.DistBackendError: [3] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '2:3', but store->get('2:3') got error: Connection reset by peer |
|
[default6]:[rank14]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): |
|
[default6]:[rank14]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f46a49b5897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) |
|
[default6]:[rank14]: frame #1: <unknown function> + 0x5b3a23e (0x7f46de4d223e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default6]:[rank14]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::string>, std::chrono::duration<long, std::ratio<1l, 1000l> >) + envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #48: <unknown function> + 0x150582 (0x55b553036582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #49: PyObject_Call + 0xbc (0x55b553036f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55b55301d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55b55302aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
0x2c7 (0x7f46de4ccc87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default6]:[rank14]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f46de4ccf82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default6]:[rank14]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f46de4cdfd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default6]:[rank14]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f46de482371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default6]:[rank14]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f46de482371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default6]:[rank14]: frame #7: c10d::PrefixStore::get(std::stri[default7]:[rank7]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55b553023007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55b553034c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #54: <unknown function> + 0x211239 (0x55b5530f7239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #55: PyObject_Call + 0x207 (0x55b553037067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55b55301d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #57: <unknown function> + 0x150582 (0x55b553036582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55b55301b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-clustng const&) + 0x31 (0x7f46de482371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default6]:[rank14]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f46de482371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default6]:[rank14]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f46a5c8f189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) |
|
[default6]:[rank14]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f46a5c96610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) |
|
[default6]:[rank14]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector<at::Tensor, std::allocator<at::Tensor> >&, int, int) + 0x5f8 (0x7f46a5cb5978 er/bin/python3.10) |
|
[default7]:[rank7]: frame #59: <unknown function> + 0x150582 (0x55b553036582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #60: PyObject_Call + 0xbc (0x55b553036f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55b55301d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) |
|
[default6]:[rank14]: frame #12: <unknown function> + 0x5adc309 (0x7f46de474309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default6]:[rank14]: frame #13: <unknown function> + 0x5ae6f10 (0x7f46de47ef10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default6]:[rank14]: frame #14: <unknown function> + 0x5ae6fa5 (0x7f46de47efa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default6]:[rank14]: frame #15: <unknown function> + 0x5124446 (0x7f46ddabc446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default6]:[rank14]: frame #16: <unknown function> + 0x1acf4b8 (0x7f46da4674b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/sit[default7]:[rank7]: frame #62: <unknown function> + 0x150582 (0x55b553036582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #63: PyObject_Call + 0xbc (0x55b553036f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: . This may indicate a possible application crash on rank 0 or a network set up issue. |
|
e-packages/torch/lib/libtorch_cpu.so) |
|
[default6]:[rank14]: frame #17: <unknown function> + 0x5aee004 (0x7f46de486004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default6]:[rank14]: frame #18: <unknown function> + 0x5af36b5 (0x7f46de48b6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default6]:[rank14]: frame #19: <unknown function> + 0xd2631e (0x7f46f107531e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) |
|
[default6]:[rank14]: frame #20: <unknown function> + 0x47def4 (0x7f46f07ccef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) |
|
[default6]:[rank14]: frame #21: <unknown function> + 0x1445a6 (0x555f8456c5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank14]: frame #22: _PyObject_MakeTpCall + 0x26b (0x555f84565a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank14]: frame #23: <unknown function> + 0x150866 (0x555f84578866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank14]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x555f84561142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank14]: frame #25: _PyFunction_Vectorcall + 0x6c (0x555f8456ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank14]: frame #26: PyObject_Call + 0xbc (0x555f84578f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank14]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x555f8455f2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank14]: frame #28: _PyFunction_Vectorcall + 0x6c (0x555f8456ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank14]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x555f8455d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank14]: frame #30: <unknown function> + 0x150582 (0x555f84578582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank14]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x555f8455d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank14]: frame #32: <unknown function> + 0x150582 (0x555f84578582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank14]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x555f8455d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank14]: frame #34: <unknown function> + 0x150582 (0x555f84578582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank14]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x555f8455d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank14]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x555f84564f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank14]: frame #37: _PyObject_Call_Prepend + 0x69 (0x555f84576c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank14]: frame #38: <unknown function> + 0x211239 (0x555f84639239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank14]: frame #39: _PyObject_MakeTpCall + 0x26b (0x555f84565a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank14]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x555f845613e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank14]: frame #41: _PyFunction_Vectorcall + 0x6c (0x555f8456ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank14]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x555f8455cc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank14]: frame #43: _PyFunction_Vectorcall + 0x6c (0x555f8456ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank14]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x555f8455d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank14]: frame #45: <unknown function> + 0x150582 (0x555f84578582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank14]: frame #46: PyObject_Call + 0xbc (0x555f84578f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank14]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x555f8455f2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank14]: frame #48: <unknown function> + 0x150582 (0x555f84578582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank14]: frame #49: PyObject_Call + 0xbc (0x555f84578f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank14]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x555f8455f2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank14]: frame #51: _PyFunction_Vectorcall + 0x6c (0x555f8456ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank14]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x555f84565007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank14]: frame #53: _PyObject_Call_Prepend + 0x69 (0x555f84576c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank14]: frame #54: <unknown function> + 0x211239 (0x555f84639239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank14]: frame #55: PyObject_Call + 0x207 (0x555f84579067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank14]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x555f8455f2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank14]: frame #57: <unknown function> + 0x150582 (0x555f84578582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank14]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x555f8455d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank14]: frame #59: <unknown function> + 0x150582 (0x555f84578582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank14]: frame #60: PyObject_Call + 0xbc (0x555f84578f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank14]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x555f8455f2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank14]: frame #62: <unknown function> + 0x150582 (0x555f84578582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank14]: frame #63: PyObject_Call + 0xbc (0x555f84578f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank14]: . This may indicate a possible application crash on rank 0 or a network set up issue. |
|
[default1]:[rank9]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) |
|
[default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step |
|
[default1]:[rank9]: outputs = self.pipeline_engine.train_batch_iter( |
|
[default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter |
|
[default1]:[rank9]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) |
|
[default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward |
|
[default1]:[rank9]: output = model(**micro_batch) |
|
[default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default1]:[rank9]: return self._call_impl(*args, **kwargs) |
|
[default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default1]:[rank9]: return forward_call(*args, **kwargs) |
|
[default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward |
|
[default1]:[rank9]: sharded_logits = self.model( |
|
[default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default1]:[rank9]: return self._call_impl(*args, **kwargs) |
|
[default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default1]:[rank9]: return forward_call(*args, **kwargs) |
|
[default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward |
|
[default1]:[rank9]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] |
|
[default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states |
|
[default1]:[rank9]: hidden_encoder_states = encoder_block(**hidden_encoder_states) |
|
[default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default1]:[rank9]: return self._call_impl(*args, **kwargs) |
|
[default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default1]:[rank9]: return forward_call(*args, **kwargs) |
|
[default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward |
|
[default1]:[rank9]: new_kwargs[name] = recv_from_pipeline_state_buffer( |
|
[default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer |
|
[default1]:[rank9]: pipeline_state.run_communication() |
|
[default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication |
|
[default1]:[rank9]: recv_activation_tensor = recv_activation() |
|
[default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ |
|
[default1]:[rank9]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] |
|
[default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors |
|
[default1]:[rank9]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) |
|
[default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors |
|
[default1]:[rank9]: meta = self._recv_meta(from_rank=from_rank, tag=tag) |
|
[default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta |
|
[default1]:[rank9]: dist.recv( |
|
[default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper |
|
[default1]:[rank9]: return func(*args, **kwargs) |
|
[default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv |
|
[default1]:[rank9]: pg.recv([tensor], group_src_rank, tag).wait() |
|
[default1]:[rank9]: torch.distributed.DistBackendError: [2] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '1:2', but store->get('1:2') got error: Connection reset by peer |
|
[default1]:[rank9]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): |
|
[default1]:[rank9]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f4c06d35897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) |
|
[default1]:[rank9]: frame #1: <unknown function> + 0x5b3a23e (0x7f4c4085223e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default1]:[rank9]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::string>, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x2c7 (0x7f4c4084cc87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default1]:[rank9]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f4c4084cf82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default1]:[rank9]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f4c4084dfd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default1]:[rank9]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f4c40802371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default1]:[rank9]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f4c40802371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default1]:[rank9]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f4c40802371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default1]:[rank9]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f4c40802371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default1]:[rank9]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f4c0800f189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) |
|
[default1]:[rank9]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f4c08016610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) |
|
[default1]:[rank9]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector<at::Tensor, std::allocator<at::Tensor> >&, int, int) + 0x5f8 (0x7f4c08035978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) |
|
[default1]:[rank9]: frame #12: <unknown function> + 0x5adc309 (0x7f4c407f4309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default1]:[rank9]: frame #13: <unknown function> + 0x5ae6f10 (0x7f4c407fef10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default1]:[rank9]: frame #14: <unknown function> + 0x5ae6fa5 (0x7f4c407fefa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default1]:[rank9]: frame #15: <unknown function> + 0x5124446 (0x7f4c3fe3c446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default1]:[rank9]: frame #16: <unknown function> + 0x1acf4b8 (0x7f4c3c7e74b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default1]:[rank9]: frame #17: <unknown function> + 0x5aee004 (0x7f4c40806004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default1]:[rank9]: frame #18: <unknown function> + 0x5af36b5 (0x7f4c4080b6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default1]:[rank9]: frame #19: <unknown function> + 0xd2631e (0x7f4c533f531e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) |
|
[default1]:[rank9]: frame #20: <unknown function> + 0x47def4 (0x7f4c52b4cef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) |
|
[default1]:[rank9]: frame #21: <unknown function> + 0x1445a6 (0x5601274665a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default1]:[rank9]: frame #22: _PyObject_MakeTpCall + 0x26b (0x56012745fa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default1]:[rank9]: frame #23: <unknown function> + 0x150866 (0x560127472866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default1]:[rank9]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x56012745b142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default1]:[rank9]: frame #25: _PyFunction_Vectorcall + 0x6c (0x560127466a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default1]:[rank9]: frame #26: PyObject_Call + 0xbc (0x560127472f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default1]:[rank9]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5601274592b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default1]:[rank9]: frame #28: _PyFunction_Vectorcall + 0x6c (0x560127466a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default1]:[rank9]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5601274578fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default1]:[rank9]: frame #30: <unknown function> + 0x150582 (0x560127472582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default1]:[rank9]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5601274578fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default1]:[rank9]: frame #32: <unknown function> + 0x150582 (0x560127472582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default1]:[rank9]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5601274578fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default1]:[rank9]: frame #34: <unknown function> + 0x150582 (0x560127472582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default1]:[rank9]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5601274578fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default1]:[rank9]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x56012745ef50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default1]:[rank9]: frame #37: _PyObject_Call_Prepend + 0x69 (0x560127470c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default1]:[rank9]: frame #38: <unknown function> + 0x211239 (0x560127533239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default1]:[rank9]: frame #39: _PyObject_MakeTpCall + 0x26b (0x56012745fa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default1]:[rank9]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x56012745b3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default1]:[rank9]: frame #41: _PyFunction_Vectorcall + 0x6c (0x560127466a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default1]:[rank9]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x560127456c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default1]:[rank9]: frame #43: _PyFunction_Vectorcall + 0x6c (0x560127466a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default1]:[rank9]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5601274578fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default1]:[rank9]: frame #45: <unknown function> + 0x150582 (0x560127472582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default1]:[rank9]: frame #46: PyObject_Call + 0xbc (0x560127472f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default1]:[rank9]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5601274592b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default1]:[rank9]: frame #48: <unknown function> + 0x150582 (0x560127472582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default1]:[rank9]: frame #49: PyObject_Call + 0xbc (0x560127472f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default1]:[rank9]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5601274592b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default1]:[rank9]: frame #51: _PyFunction_Vectorcall + 0x6c (0x560127466a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default1]:[rank9]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x56012745f007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default1]:[rank9]: frame #53: _PyObject_Call_Prepend + 0x69 (0x560127470c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default1]:[rank9]: frame #54: <unknown function> + 0x211239 (0x560127533239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default1]:[rank9]: frame #55: PyObject_Call + 0x207 (0x560127473067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default1]:[rank9]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5601274592b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default1]:[rank9]: frame #57: <unknown function> + 0x150582 (0x560127472582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default1]:[rank9]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5601274578fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default1]:[rank9]: frame #59: <unknown function> + 0x150582 (0x560127472582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default1]:[rank9]: frame #60: PyObject_Call + 0xbc (0x560127472f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default1]:[rank9]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5601274592b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default1]:[rank9]: frame #62: <unknown function> + 0x150582 (0x560127472582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default1]:[rank9]: frame #63: PyObject_Call + 0xbc (0x560127472f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default1]:[rank9]: . This may indicate a possible application crash on rank 0 or a network set up issue. |
|
[default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors |
|
[default0]:[rank8]: meta = self._recv_meta(from_rank=from_rank, tag=tag) |
|
[default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta |
|
[default0]:[rank8]: dist.recv( |
|
[default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper |
|
[default0]:[rank8]: return func(*args, **kwargs) |
|
[default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv |
|
[default0]:[rank8]: pg.recv([tensor], group_src_rank, tag).wait() |
|
[default0]:[rank8]: torch.distributed.DistBackendError: [2] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '1:2', but store->get('1:2') got error: Connection reset by peer |
|
[default0]:[rank8]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): |
|
[default0]:[rank8]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2aca8e0897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) |
|
[default0]:[rank8]: frame #1: <unknown function> + 0x5b3a23e (0x7f2b043fd23e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default0]:[rank8]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::string>, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x2c7 (0x7f2b043f7c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default0]:[rank8]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f2b043f7f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default0]:[rank8]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f2b043f8fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default0]:[rank8]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2b043ad371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default0]:[rank8]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2b043ad371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default0]:[rank8]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2b043ad371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default0]:[rank8]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2b043ad371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default0]:[rank8]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f2acbbba189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) |
|
[default0]:[rank8]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f2acbbc1610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) |
|
[default0]:[rank8]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector<at::Tensor, std::allocator<at::Tensor> >&, int, int) + 0x5f8 (0x7f2acbbe0978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) |
|
[default0]:[rank8]: frame #12: <unknown function> + 0x5adc309 (0x7f2b0439f309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default0]:[rank8]: frame #13: <unknown function> + 0x5ae6f10 (0x7f2b043a9f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default0]:[rank8]: frame #14: <unknown function> + 0x5ae6fa5 (0x7f2b043a9fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default0]:[rank8]: frame #15: <unknown function> + 0x5124446 (0x7f2b039e7446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default0]:[rank8]: frame #16: <unknown function> + 0x1acf4b8 (0x7f2b003924b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default0]:[rank8]: frame #17: <unknown function> + 0x5aee004 (0x7f2b043b1004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default0]:[rank8]: frame #18: <unknown function> + 0x5af36b5 (0x7f2b043b66b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default0]:[rank8]: frame #19: <unknown function> + 0xd2631e (0x7f2b16fa031e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) |
|
[default0]:[rank8]: frame #20: <unknown function> + 0x47def4 (0x7f2b166f7ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) |
|
[default0]:[rank8]: frame #21: <unknown function> + 0x1445a6 (0x5614367865a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default0]:[rank8]: frame #22: _PyObject_MakeTpCall + 0x26b (0x56143677fa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default0]:[rank8]: frame #23: <unknown function> + 0x150866 (0x561436792866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default0]:[rank8]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x56143677b142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default0]:[rank8]: frame #25: _PyFunction_Vectorcall + 0x6c (0x561436786a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default0]:[rank8]: frame #26: PyObject_Call + 0xbc (0x561436792f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default0]:[rank8]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5614367792b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default0]:[rank8]: frame #28: _PyFunction_Vectorcall + 0x6c (0x561436786a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default0]:[rank8]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5614367778fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default0]:[rank8]: frame #30: <unknown function> + 0x150582 (0x561436792582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default0]:[rank8]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5614367778fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default0]:[rank8]: frame #32: <unknown function> + 0x150582 (0x561436792582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default0]:[rank8]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5614367778fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default0]:[rank8]: frame #34: <unknown function> + 0x150582 (0x561436792582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default0]:[rank8]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5614367778fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default0]:[rank8]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x56143677ef50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default0]:[rank8]: frame #37: _PyObject_Call_Prepend + 0x69 (0x561436790c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default0]:[rank8]: frame #38: <unknown function> + 0x211239 (0x561436853239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default0]:[rank8]: frame #39: _PyObject_MakeTpCall + 0x26b (0x56143677fa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default0]:[rank8]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x56143677b3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default0]:[rank8]: frame #41: _PyFunction_Vectorcall + 0x6c (0x561436786a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default0]:[rank8]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x561436776c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default0]:[rank8]: frame #43: _PyFunction_Vectorcall + 0x6c (0x561436786a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default0]:[rank8]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5614367778fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default0]:[rank8]: frame #45: <unknown function> + 0x150582 (0x561436792582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default0]:[rank8]: frame #46: PyObject_Call + 0xbc (0x561436792f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default0]:[rank8]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5614367792b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default0]:[rank8]: frame #48: <unknown function> + 0x150582 (0x561436792582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default0]:[rank8]: frame #49: PyObject_Call + 0xbc (0x561436792f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default0]:[rank8]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5614367792b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default0]:[rank8]: frame #51: _PyFunction_Vectorcall + 0x6c (0x561436786a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank13]: Traceback (most recent call last): |
|
[default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in <module> |
|
[default5]:[rank13]: trainer.train(dataloader) |
|
[default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train |
|
[default5]:[rank13]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) |
|
[default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step |
|
[default5]:[rank13]: outputs = self.pipeline_engine.train_batch_iter( |
|
[default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter |
|
[default5]:[rank13]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) |
|
[default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward |
|
[default5]:[rank13]: output = model(**micro_batch) |
|
[default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default5]:[rank13]: return self._call_impl(*args, **kwargs) |
|
[default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default5]:[rank13]: return forward_call(*args, **kwargs) |
|
[default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward |
|
[default5]:[rank13]: sharded_logits = self.model( |
|
[default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default5]:[rank13]: return self._call_impl(*args, **kwargs) |
|
[default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default5]:[rank13]: return forward_call(*args, **kwargs) |
|
[default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward |
|
[default5]:[rank13]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] |
|
[default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states |
|
[default5]:[rank13]: hidden_encoder_states = encoder_block(**hidden_encoder_states) |
|
[default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default5]:[rank13]: return self._call_impl(*args, **kwargs) |
|
[default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default5]:[rank13]: return forward_call(*args, **kwargs) |
|
[default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward |
|
[default5]:[rank13]: new_kwargs[name] = recv_from_pipeline_state_buffer( |
|
[default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer |
|
[default5]:[rank13]: pipeline_state.run_communication() |
|
[default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication |
|
[default5]:[rank13]: recv_activation_tensor = recv_activation() |
|
[default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ |
|
[default5]:[rank13]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] |
|
[default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors |
|
[default5]:[rank13]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) |
|
[default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors |
|
[default5]:[rank13]: meta = self._recv_meta(from_rank=from_rank, tag=tag) |
|
[default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta |
|
[default5]:[rank13]: dist.recv( |
|
[default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper |
|
[default5]:[rank13]: return func(*args, **kwargs) |
|
[default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv |
|
[default5]:[rank13]: pg.recv([tensor], group_src_rank, tag).wait() |
|
[default5]:[rank13]: torch.distributed.DistBackendError: [3] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '2:3', but store->get('2:3') got error: Connection reset by peer |
|
[default5]:[rank13]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): |
|
[default5]:[rank13]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f48e30c8897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) |
|
[default5]:[rank13]: frame #1: <unknown function> + 0x5b3a23e (0x7f491cbe523e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default5]:[rank13]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::string>, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x2c7 (0x7f491cbdfc87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default3]:[rank11]: Traceback (most recent call last): |
|
[default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in <module> |
|
[default3]:[rank11]: trainer.train(dataloader) |
|
[default5]:[rank13]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f491cbdff82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default5]:[rank13]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f491cbe0fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default5]:[rank13]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f491cb95371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default5]:[rank13]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f491cb95371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default5]:[rank13]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f491cb95371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default5]:[rank13]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f491cb95371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default5]:[rank13]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f48e43a2189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) |
|
[default5]:[rank13]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f48e43a9610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) |
|
[default5]:[rank13]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector<at::Tensor, std::allocator<at::Tensor> >&, int, int) + 0x5f8 (0x7f48e43c8978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) |
|
[default5]:[rank13]: frame #12: <unknown function> + 0x5adc309 (0x7f491cb87309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default5]:[rank13]: frame #13: <unknown function> + 0x5ae6f10 (0x7f491cb91f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train |
|
[default3]:[rank11]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) |
|
[default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step |
|
[default3]:[rank11]: outputs = self.pipeline_engine.train_batch_iter( |
|
[default5]:[rank13]: frame #14: <unknown function> + 0x5ae6fa5 (0x7f491cb91fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default5]:[rank13]: frame #15: <unknown function> + 0x5124446 (0x7f491c1cf446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default5]:[rank13]: frame #16: <unknown function> + 0x1acf4b8 (0x7f4918b7a4b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default5]:[rank13]: frame #17: <unknown function> + 0x5aee004 (0x7f491cb99004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default5]:[rank13]: frame #18: <unknown function> + 0x5af36b5 (0x7f491cb9e6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter |
|
[default3]:[rank11]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) |
|
[default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward |
|
[default5]:[rank13]: frame #19: <unknown function> + 0xd2631e (0x7f492f78831e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) |
|
[default5]:[rank13]: frame #20: <unknown function> + 0x47def4 (0x7f492eedfef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) |
|
[default3]:[rank11]: output = model(**micro_batch) |
|
[default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default3]:[rank11]: return self._call_impl(*args, **kwargs) |
|
[default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default3]:[rank11]: return forward_call(*args, **kwargs) |
|
[default5]:[rank13]: frame #21: <unknown function> + 0x1445a6 (0x55a4990e65a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank13]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55a4990dfa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward |
|
[default3]:[rank11]: sharded_logits = self.model( |
|
[default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default3]:[rank11]: return self._call_impl(*args, **kwargs) |
|
[default5]:[rank13]: frame #23: <unknown function> + 0x150866 (0x55a4990f2866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank13]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55a4990db142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default3]:[rank11]: return forward_call(*args, **kwargs) |
|
[default5]:[rank13]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55a4990e6a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank13]: frame #26: PyObject_Call + 0xbc (0x55a4990f2f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward |
|
[default3]:[rank11]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] |
|
[default5]:[rank13]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55a4990d92b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank13]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55a4990e6a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank13]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55a4990d78fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states |
|
[default3]:[rank11]: hidden_encoder_states = encoder_block(**hidden_encoder_states) |
|
[default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default5]:[rank13]: frame #30: <unknown function> + 0x150582 (0x55a4990f2582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank13]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55a4990d78fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank13]: frame #32: <unknown function> + 0x150582 (0x55a4990f2582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank13]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55a4990d78fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank13]: frame #34: <unknown function> + 0x150582 (0x55a4990f2582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank13]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55a4990d78fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank11]: return self._call_impl(*args, **kwargs) |
|
[default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default3]:[rank11]: return forward_call(*args, **kwargs) |
|
[default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward |
|
[default3]:[rank11]: new_kwargs[name] = recv_from_pipeline_state_buffer( |
|
[default5]:[rank13]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55a4990def50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank13]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55a4990f0c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer |
|
[default3]:[rank11]: pipeline_state.run_communication() |
|
[default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication |
|
[default3]:[rank11]: recv_activation_tensor = recv_activation() |
|
[default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ |
|
[default3]:[rank11]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] |
|
[default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors |
|
[default5]:[rank13]: frame #38: <unknown function> + 0x211239 (0x55a4991b3239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank13]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55a4990dfa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank13]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55a4990db3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank13]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55a4990e6a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank13]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55a4990d6c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank11]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) |
|
[default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors |
|
[default3]:[rank11]: meta = self._recv_meta(from_rank=from_rank, tag=tag) |
|
[default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta |
|
[default3]:[rank11]: dist.recv( |
|
[default5]:[rank13]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55a4990e6a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: Traceback (most recent call last): |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in <module> |
|
[default4]:[rank4]: trainer.train(dataloader) |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train |
|
[default4]:[rank4]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step |
|
[default4]:[rank4]: outputs = self.pipeline_engine.train_batch_iter( |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter |
|
[default4]:[rank4]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward |
|
[default4]:[rank4]: output = model(**micro_batch) |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default4]:[rank4]: return self._call_impl(*args, **kwargs) |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default4]:[rank4]: return forward_call(*args, **kwargs) |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward |
|
[default4]:[rank4]: sharded_logits = self.model( |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default4]:[rank4]: return self._call_impl(*args, **kwargs) |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default4]:[rank4]: return forward_call(*args, **kwargs) |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward |
|
[default4]:[rank4]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states |
|
[default4]:[rank4]: hidden_encoder_states = encoder_block(**hidden_encoder_states) |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default4]:[rank4]: return self._call_impl(*args, **kwargs) |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default4]:[rank4]: return forward_call(*args, **kwargs) |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward |
|
[default4]:[rank4]: new_kwargs[name] = recv_from_pipeline_state_buffer( |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer |
|
[default4]:[rank4]: pipeline_state.run_communication() |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication |
|
[default4]:[rank4]: recv_activation_tensor = recv_activation() |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ |
|
[default4]:[rank4]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors |
|
[default4]:[rank4]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors |
|
[default4]:[rank4]: meta = self._recv_meta(from_rank=from_rank, tag=tag) |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta |
|
[default4]:[rank4]: dist.recv( |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper |
|
[default4]:[rank4]: return func(*args, **kwargs) |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv |
|
[default4]:[rank4]: pg.recv([tensor], group_src_rank, tag).wait() |
|
[default4]:[rank4]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer |
|
[default4]:[rank4]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): |
|
[default4]:[rank4]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8006496897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) |
|
[default4]:[rank4]: frame #1: <unknown function> + 0x5b3a23e (0x7f803ffb323e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default4]:[rank4]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::string>, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x2c7 (0x7f803ffadc87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default4]:[rank4]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f803ffadf82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default4]:[rank4]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f803ffaefd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default4]:[rank4]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f803ff63371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default4]:[rank4]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f803ff63371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default4]:[rank4]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f803ff63371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default4]:[rank4]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f803ff63371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default4]:[rank4]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f8007770189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) |
|
[default4]:[rank4]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f8007777610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) |
|
[default4]:[rank4]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector<at::Tensor, std::allocator<at::Tensor> >&, int, int) + 0x5f8 (0x7f8007796978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) |
|
[default4]:[rank4]: frame #12: <unknown function> + 0x5adc309 (0x7f803ff55309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default4]:[rank4]: frame #13: <unknown function> + 0x5ae6f10 (0x7f803ff5ff10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default4]:[rank4]: frame #14: <unknown function> + 0x5ae6fa5 (0x7f803ff5ffa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default4]:[rank4]: frame #15: <unknown function> + 0x5124446 (0x7f803f59d446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default4]:[rank4]: frame #16: <unknown function> + 0x1acf4b8 (0x7f803bf484b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default4]:[rank4]: frame #17: <unknown function> + 0x5aee004 (0x7f803ff67004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default4]:[rank4]: frame #18: <unknown function> + 0x5af36b5 (0x7f803ff6c6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default4]:[rank4]: frame #19: <unknown function> + 0xd2631e (0x7f8052b5631e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) |
|
[default4]:[rank4]: frame #20: <unknown function> + 0x47def4 (0x7f80522adef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) |
|
[default4]:[rank4]: frame #21: <unknown function> + 0x1445a6 (0x55616f6695a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55616f662a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #23: <unknown function> + 0x150866 (0x55616f675866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55616f65e142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55616f669a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #26: PyObject_Call + 0xbc (0x55616f675f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55616f65c2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55616f669a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55616f65a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #30: <unknown function> + 0x150582 (0x55616f675582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55616f65a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #32: <unknown function> + 0x150582 (0x55616f675582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55616f65a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #34: <unknown function> + 0x150582 (0x55616f675582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55616f65a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55616f661f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55616f673c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #38: <unknown function> + 0x211239 (0x55616f736239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55616f662a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55616f65e3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55616f669a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55616f659c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55616f669a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55616f65a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #45: <unknown function> + 0x150582 (0x55616f675582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #46: PyObject_Call + 0xbc (0x55616f675f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55616f65c2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #48: <unknown function> + 0x150582 (0x55616f675582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #49: PyObject_Call + 0xbc (0x55616f675f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55616f65c2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55616f669a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55616f662007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55616f673c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #54: <unknown function> + 0x211239 (0x55616f736239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #55: PyObject_Call + 0x207 (0x55616f676067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55616f65c2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #57: <unknown function> + 0x150582 (0x55616f675582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55616f65a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #59: <unknown function> + 0x150582 (0x55616f675582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #60: PyObject_Call + 0xbc (0x55616f675f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55616f65c2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #62: <unknown function> + 0x150582 (0x55616f675582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #63: PyObject_Call + 0xbc (0x55616f675f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: . This may indicate a possible application crash on rank 0 or a network set up issue. |
|
[default6]:[rank6]: Traceback (most recent call last): |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in <module> |
|
[default6]:[rank6]: trainer.train(dataloader) |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train |
|
[default6]:[rank6]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step |
|
[default6]:[rank6]: outputs = self.pipeline_engine.train_batch_iter( |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter |
|
[default6]:[rank6]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward |
|
[default6]:[rank6]: output = model(**micro_batch) |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default6]:[rank6]: return self._call_impl(*args, **kwargs) |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default6]:[rank6]: return forward_call(*args, **kwargs) |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward |
|
[default6]:[rank6]: sharded_logits = self.model( |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default6]:[rank6]: return self._call_impl(*args, **kwargs) |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default6]:[rank6]: return forward_call(*args, **kwargs) |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward |
|
[default6]:[rank6]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states |
|
[default6]:[rank6]: hidden_encoder_states = encoder_block(**hidden_encoder_states) |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default6]:[rank6]: return self._call_impl(*args, **kwargs) |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default6]:[rank6]: return forward_call(*args, **kwargs) |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward |
|
[default6]:[rank6]: new_kwargs[name] = recv_from_pipeline_state_buffer( |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer |
|
[default6]:[rank6]: pipeline_state.run_communication() |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication |
|
[default6]:[rank6]: recv_activation_tensor = recv_activation() |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ |
|
[default6]:[rank6]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors |
|
[default6]:[rank6]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors |
|
[default6]:[rank6]: meta = self._recv_meta(from_rank=from_rank, tag=tag) |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta |
|
[default6]:[rank6]: dist.recv( |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper |
|
[default6]:[rank6]: return func(*args, **kwargs) |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv |
|
[default6]:[rank6]: pg.recv([tensor], group_src_rank, tag).wait() |
|
[default6]:[rank6]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer |
|
[default6]:[rank6]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): |
|
[default6]:[rank6]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fce378f7897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) |
|
[default6]:[rank6]: frame #1: <unknown function> + 0x5b3a23e (0x7fce7141423e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default6]:[rank6]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::string>, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x2c7 (0x7fce7140ec87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default6]:[rank6]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fce7140ef82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default6]:[rank6]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fce7140ffd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default6]:[rank6]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fce713c4371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default6]:[rank6]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fce713c4371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default6]:[rank6]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fce713c4371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default6]:[rank6]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fce713c4371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default6]:[rank6]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fce38bd1189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) |
|
[default6]:[rank6]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7fce38bd8610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) |
|
[default6]:[rank6]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector<at::Tensor, std::allocator<at::Tensor> >&, int, int) + 0x5f8 (0x7fce38bf7978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) |
|
[default6]:[rank6]: frame #12: <unknown function> + 0x5adc309 (0x7fce713b6309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default6]:[rank6]: frame #13: <unknown function> + 0x5ae6f10 (0x7fce713c0f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default6]:[rank6]: frame #14: <unknown function> + 0x5ae6fa5 (0x7fce713c0fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default6]:[rank6]: frame #15: <unknown function> + 0x5124446 (0x7fce709fe446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default6]:[rank6]: frame #16: <unknown function> + 0x1acf4b8 (0x7fce6d3a94b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default6]:[rank6]: frame #17: <unknown function> + 0x5aee004 (0x7fce713c8004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default6]:[rank6]: frame #18: <unknown function> + 0x5af36b5 (0x7fce713cd6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default6]:[rank6]: frame #19: <unknown function> + 0xd2631e (0x7fce83fb731e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) |
|
[default6]:[rank6]: frame #20: <unknown function> + 0x47def4 (0x7fce8370eef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) |
|
[default6]:[rank6]: frame #21: <unknown function> + 0x1445a6 (0x561013b155a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #22: _PyObject_MakeTpCall + 0x26b (0x561013b0ea6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #23: <unknown function> + 0x150866 (0x561013b21866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x561013b0a142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #25: _PyFunction_Vectorcall + 0x6c (0x561013b15a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #26: PyObject_Call + 0xbc (0x561013b21f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x561013b082b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #28: _PyFunction_Vectorcall + 0x6c (0x561013b15a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x561013b068fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #30: <unknown function> + 0x150582 (0x561013b21582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x561013b068fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #32: <unknown function> + 0x150582 (0x561013b21582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x561013b068fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #34: <unknown function> + 0x150582 (0x561013b21582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x561013b068fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x561013b0df50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #37: _PyObject_Call_Prepend + 0x69 (0x561013b1fc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #38: <unknown function> + 0x211239 (0x561013be2239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #39: _PyObject_MakeTpCall + 0x26b (0x561013b0ea6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x561013b0a3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #41: _PyFunction_Vectorcall + 0x6c (0x561013b15a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x561013b05c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #43: _PyFunction_Vectorcall + 0x6c (0x561013b15a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x561013b068fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #45: <unknown function> + 0x150582 (0x561013b21582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #46: PyObject_Call + 0xbc (0x561013b21f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x561013b082b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #48: <unknown function> + 0x150582 (0x561013b21582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #49: PyObject_Call + 0xbc (0x561013b21f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x561013b082b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #51: _PyFunction_Vectorcall + 0x6c (0x561013b15a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x561013b0e007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #53: _PyObject_Call_Prepend + 0x69 (0x561013b1fc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #54: <unknown function> + 0x211239 (0x561013be2239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #55: PyObject_Call + 0x207 (0x561013b22067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x561013b082b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #57: <unknown function> + 0x150582 (0x561013b21582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x561013b068fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #59: <unknown function> + 0x150582 (0x561013b21582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #60: PyObject_Call + 0xbc (0x561013b21f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x561013b082b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #62: <unknown function> + 0x150582 (0x561013b21582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #63: PyObject_Call + 0xbc (0x561013b21f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: . This may indicate a possible application crash on rank 0 or a network set up issue. |
|
[default5]:[rank13]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55a4990d78fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default0]:[rank8]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x56143677f007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default0]:[rank8]: frame #53: _PyObject_Call_Prepend + 0x69 (0x561436790c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default0]:[rank8]: frame #54: <unknown function> + 0x211239 (0x561436853239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default0]:[rank8]: frame #55: PyObject_Call + 0x207 (0x561436793067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default0]:[rank8]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5614367792b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default0]:[rank8]: frame #57: <unknown function> + 0x150582 (0x561436792582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default0]:[rank8]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5614367778fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default0]:[rank8]: frame #59: <unknown function> + 0x150582 (0x561436792582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default0]:[rank8]: frame #60: PyObject_Call + 0xbc (0x561436792f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default0]:[rank8]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5614367792b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default0]:[rank8]: frame #62: <unknown function> + 0x150582 (0x561436792582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default0]:[rank8]: frame #63: PyObject_Call + 0xbc (0x561436792f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default0]:[rank8]: . This may indicate a possible application crash on rank 0 or a network set up issue. |
|
[default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper |
|
[default3]:[rank11]: return func(*args, **kwargs) |
|
[default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv |
|
[default3]:[rank11]: pg.recv([tensor], group_src_rank, tag).wait() |
|
[default3]:[rank11]: torch.distributed.DistBackendError: [2] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '1:2', but store->get('1:2') got error: Connection reset by peer |
|
[default3]:[rank11]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): |
|
[default3]:[rank11]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f7b85951897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) |
|
[default3]:[rank11]: frame #1: <unknown function> + 0x5b3a23e (0x7f7bbf46e23e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default3]:[rank11]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::string>, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x2c7 (0x7f7bbf468c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default3]:[rank11]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f7bbf468f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default3]:[rank11]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f7bbf469fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default3]:[rank11]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f7bbf41e371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default3]:[rank11]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f7bbf41e371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default3]:[rank11]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f7bbf41e371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default3]:[rank11]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f7bbf41e371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default3]:[rank11]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f7b86c2b189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) |
|
[default3]:[rank11]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f7b86c32610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) |
|
[default3]:[rank11]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector<at::Tensor, std::allocator<at::Tensor> >&, int, int) + 0x5f8 (0x7f7b86c51978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) |
|
[default3]:[rank11]: frame #12: <unknown function> + 0x5adc309 (0x7f7bbf410309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default3]:[rank11]: frame #13: <unknown function> + 0x5ae6f10 (0x7f7bbf41af10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default3]:[rank11]: frame #14: <unknown function> + 0x5ae6fa5 (0x7f7bbf41afa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default3]:[rank11]: frame #15: <unknown function> + 0x5124446 (0x7f7bbea58446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default3]:[rank11]: frame #16: <unknown function> + 0x1acf4b8 (0x7f7bbb4034b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default3]:[rank11]: frame #17: <unknown function> + 0x5aee004 (0x7f7bbf422004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default3]:[rank11]: frame #18: <unknown function> + 0x5af36b5 (0x7f7bbf4276b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default3]:[rank11]: frame #19: <unknown function> + 0xd2631e (0x7f7bd201131e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) |
|
[default3]:[rank11]: frame #20: <unknown function> + 0x47def4 (0x7f7bd1768ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) |
|
[default3]:[rank11]: frame #21: <unknown function> + 0x1445a6 (0x56235cd215a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank11]: frame #22: _PyObject_MakeTpCall + 0x26b (0x56235cd1aa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank11]: frame #23: <unknown function> + 0x150866 (0x56235cd2d866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank11]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x56235cd16142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank11]: frame #25: _PyFunction_Vectorcall + 0x6c (0x56235cd21a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank11]: frame #26: PyObject_Call + 0xbc (0x56235cd2df1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank11]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x56235cd142b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank11]: frame #28: _PyFunction_Vectorcall + 0x6c (0x56235cd21a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank11]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x56235cd128fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank11]: frame #30: <unknown function> + 0x150582 (0x56235cd2d582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank11]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x56235cd128fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank11]: frame #32: <unknown function> + 0x150582 (0x56235cd2d582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank11]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x56235cd128fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank11]: frame #34: <unknown function> + 0x150582 (0x56235cd2d582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank11]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x56235cd128fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank11]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x56235cd19f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank11]: frame #37: _PyObject_Call_Prepend + 0x69 (0x56235cd2bc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank11]: frame #38: <unknown function> + 0x211239 (0x56235cdee239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank11]: frame #39: _PyObject_MakeTpCall + 0x26b (0x56235cd1aa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank11]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x56235cd163e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank11]: frame #41: _PyFunction_Vectorcall + 0x6c (0x56235cd21a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank11]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x56235cd11c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank11]: frame #43: _PyFunction_Vectorcall + 0x6c (0x56235cd21a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank11]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x56235cd128fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank11]: frame #45: <unknown function> + 0x150582 (0x56235cd2d582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank11]: frame #46: PyObject_Call + 0xbc (0x56235cd2df1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank11]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x56235cd142b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank11]: frame #48: <unknown function> + 0x150582 (0x56235cd2d582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank11]: frame #49: PyObject_Call + 0xbc (0x56235cd2df1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank11]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x56235cd142b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank11]: frame #51: _PyFunction_Vectorcall + 0x6c (0x56235cd21a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank11]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x56235cd1a007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank11]: frame #53: _PyObject_Call_Prepend + 0x69 (0x56235cd2bc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank11]: frame #54: <unknown function> + 0x211239 (0x56235cdee239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank11]: frame #55: PyObject_Call + 0x207 (0x56235cd2e067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank11]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x56235cd142b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank11]: frame #57: <unknown function> + 0x150582 (0x56235cd2d582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank11]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x56235cd128fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank11]: frame #59: <unknown function> + 0x150582 (0x56235cd2d582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank11]: frame #60: PyObject_Call + 0xbc (0x56235cd2df1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank11]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x56235cd142b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank11]: frame #62: <unknown function> + 0x150582 (0x56235cd2d582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank11]: frame #63: PyObject_Call + 0xbc (0x56235cd2df1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank11]: . This may indicate a possible application crash on rank 0 or a network set up issue. |
|
[default5]:[rank13]: frame #45: <unknown function> + 0x150582 (0x55a4990f2582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank13]: frame #46: PyObject_Call + 0xbc (0x55a4990f2f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank13]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55a4990d92b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank13]: frame #48: <unknown function> + 0x150582 (0x55a4990f2582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank13]: frame #49: PyObject_Call + 0xbc (0x55a4990f2f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank13]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55a4990d92b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank13]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55a4990e6a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank13]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55a4990df007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank13]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55a4990f0c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank13]: frame #54: <unknown function> + 0x211239 (0x55a4991b3239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank13]: frame #55: PyObject_Call + 0x207 (0x55a4990f3067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank13]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55a4990d92b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank13]: frame #57: <unknown function> + 0x150582 (0x55a4990f2582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank13]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55a4990d78fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank13]: frame #59: <unknown function> + 0x150582 (0x55a4990f2582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank13]: frame #60: PyObject_Call + 0xbc (0x55a4990f2f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank13]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55a4990d92b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank13]: frame #62: <unknown function> + 0x150582 (0x55a4990f2582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank13]: frame #63: PyObject_Call + 0xbc (0x55a4990f2f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank13]: . This may indicate a possible application crash on rank 0 or a network set up issue. |
|
[default7]:[rank15]: Traceback (most recent call last): |
|
[default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in <module> |
|
[default7]:[rank15]: trainer.train(dataloader) |
|
[default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train |
|
[default7]:[rank15]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) |
|
[default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step |
|
[default7]:[rank15]: outputs = self.pipeline_engine.train_batch_iter( |
|
[default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter |
|
[default7]:[rank15]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) |
|
[default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward |
|
[default7]:[rank15]: output = model(**micro_batch) |
|
[default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default7]:[rank15]: return self._call_impl(*args, **kwargs) |
|
[default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default7]:[rank15]: return forward_call(*args, **kwargs) |
|
[default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward |
|
[default7]:[rank15]: sharded_logits = self.model( |
|
[default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default7]:[rank15]: return self._call_impl(*args, **kwargs) |
|
[default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default7]:[rank15]: return forward_call(*args, **kwargs) |
|
[default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward |
|
[default7]:[rank15]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] |
|
[default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states |
|
[default7]:[rank15]: hidden_encoder_states = encoder_block(**hidden_encoder_states) |
|
[default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default7]:[rank15]: return self._call_impl(*args, **kwargs) |
|
[default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default7]:[rank15]: return forward_call(*args, **kwargs) |
|
[default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward |
|
[default7]:[rank15]: new_kwargs[name] = recv_from_pipeline_state_buffer( |
|
[default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer |
|
[default7]:[rank15]: pipeline_state.run_communication() |
|
[default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication |
|
[default7]:[rank15]: recv_activation_tensor = recv_activation() |
|
[default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ |
|
[default7]:[rank15]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] |
|
[default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors |
|
[default7]:[rank15]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) |
|
[default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors |
|
[default7]:[rank15]: meta = self._recv_meta(from_rank=from_rank, tag=tag) |
|
[default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta |
|
[default7]:[rank15]: dist.recv( |
|
[default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper |
|
[default7]:[rank15]: return func(*args, **kwargs) |
|
[default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv |
|
[default7]:[rank15]: pg.recv([tensor], group_src_rank, tag).wait() |
|
[default7]:[rank15]: torch.distributed.DistBackendError: [3] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '2:3', but store->get('2:3') got error: Connection reset by peer |
|
[default7]:[rank15]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): |
|
[default7]:[rank15]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f706aab4897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) |
|
[default7]:[rank15]: frame #1: <unknown function> + 0x5b3a23e (0x7f70a45d123e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default7]:[rank15]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::string>, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x2c7 (0x7f70a45cbc87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default7]:[rank15]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f70a45cbf82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default7]:[rank15]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f70a45ccfd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default7]:[rank15]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f70a4581371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default7]:[rank15]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f70a4581371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default7]:[rank15]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f70a4581371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default7]:[rank15]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f70a4581371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default7]:[rank15]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f706bd8e189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) |
|
[default7]:[rank15]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f706bd95610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) |
|
[default7]:[rank15]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector<at::Tensor, std::allocator<at::Tensor> >&, int, int) + 0x5f8 (0x7f706bdb4978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) |
|
[default7]:[rank15]: frame #12: <unknown function> + 0x5adc309 (0x7f70a4573309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default7]:[rank15]: frame #13: <unknown function> + 0x5ae6f10 (0x7f70a457df10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default7]:[rank15]: frame #14: <unknown function> + 0x5ae6fa5 (0x7f70a457dfa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default7]:[rank15]: frame #15: <unknown function> + 0x5124446 (0x7f70a3bbb446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default7]:[rank15]: frame #16: <unknown function> + 0x1acf4b8 (0x7f70a05664b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default7]:[rank15]: frame #17: <unknown function> + 0x5aee004 (0x7f70a4585004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default7]:[rank15]: frame #18: <unknown function> + 0x5af36b5 (0x7f70a458a6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default7]:[rank15]: frame #19: <unknown function> + 0xd2631e (0x7f70b717431e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) |
|
[default7]:[rank15]: frame #20: <unknown function> + 0x47def4 (0x7f70b68cbef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) |
|
[default7]:[rank15]: frame #21: <unknown function> + 0x1445a6 (0x56118d23a5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank15]: frame #22: _PyObject_MakeTpCall + 0x26b (0x56118d233a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank15]: frame #23: <unknown function> + 0x150866 (0x56118d246866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank15]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x56118d22f142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank15]: frame #25: _PyFunction_Vectorcall + 0x6c (0x56118d23aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank15]: frame #26: PyObject_Call + 0xbc (0x56118d246f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank15]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x56118d22d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank15]: frame #28: _PyFunction_Vectorcall + 0x6c (0x56118d23aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank15]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x56118d22b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank15]: frame #30: <unknown function> + 0x150582 (0x56118d246582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank15]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x56118d22b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank15]: frame #32: <unknown function> + 0x150582 (0x56118d246582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank15]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x56118d22b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank15]: frame #34: <unknown function> + 0x150582 (0x56118d246582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank15]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x56118d22b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank15]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x56118d232f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank15]: frame #37: _PyObject_Call_Prepend + 0x69 (0x56118d244c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank15]: frame #38: <unknown function> + 0x211239 (0x56118d307239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank15]: frame #39: _PyObject_MakeTpCall + 0x26b (0x56118d233a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank15]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x56118d22f3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank15]: frame #41: _PyFunction_Vectorcall + 0x6c (0x56118d23aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank15]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x56118d22ac5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank15]: frame #43: _PyFunction_Vectorcall + 0x6c (0x56118d23aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank15]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x56118d22b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank15]: frame #45: <unknown function> + 0x150582 (0x56118d246582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank15]: frame #46: PyObject_Call + 0xbc (0x56118d246f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank15]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x56118d22d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank15]: frame #48: <unknown function> + 0x150582 (0x56118d246582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank15]: frame #49: PyObject_Call + 0xbc (0x56118d246f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank15]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x56118d22d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank15]: frame #51: _PyFunction_Vectorcall + 0x6c (0x56118d23aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank15]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x56118d233007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank15]: frame #53: _PyObject_Call_Prepend + 0x69 (0x56118d244c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank15]: frame #54: <unknown function> + 0x211239 (0x56118d307239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank15]: frame #55: PyObject_Call + 0x207 (0x56118d247067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank15]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x56118d22d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank15]: frame #57: <unknown function> + 0x150582 (0x56118d246582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank15]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x56118d22b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank15]: frame #59: <unknown function> + 0x150582 (0x56118d246582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank15]: frame #60: PyObject_Call + 0xbc (0x56118d246f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank15]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x56118d22d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank15]: frame #62: <unknown function> + 0x150582 (0x56118d246582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank15]: frame #63: PyObject_Call + 0xbc (0x56118d246f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank15]: . This may indicate a possible application crash on rank 0 or a network set up issue. |
|
[default5]:[rank5]: Traceback (most recent call last): |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in <module> |
|
[default5]:[rank5]: trainer.train(dataloader) |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train |
|
[default5]:[rank5]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step |
|
[default5]:[rank5]: outputs = self.pipeline_engine.train_batch_iter( |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter |
|
[default5]:[rank5]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward |
|
[default5]:[rank5]: output = model(**micro_batch) |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default5]:[rank5]: return self._call_impl(*args, **kwargs) |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default5]:[rank5]: return forward_call(*args, **kwargs) |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward |
|
[default5]:[rank5]: sharded_logits = self.model( |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default5]:[rank5]: return self._call_impl(*args, **kwargs) |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default5]:[rank5]: return forward_call(*args, **kwargs) |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward |
|
[default5]:[rank5]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states |
|
[default5]:[rank5]: hidden_encoder_states = encoder_block(**hidden_encoder_states) |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default5]:[rank5]: return self._call_impl(*args, **kwargs) |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default5]:[rank5]: return forward_call(*args, **kwargs) |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward |
|
[default5]:[rank5]: new_kwargs[name] = recv_from_pipeline_state_buffer( |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer |
|
[default5]:[rank5]: pipeline_state.run_communication() |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication |
|
[default5]:[rank5]: recv_activation_tensor = recv_activation() |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ |
|
[default5]:[rank5]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors |
|
[default5]:[rank5]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors |
|
[default5]:[rank5]: meta = self._recv_meta(from_rank=from_rank, tag=tag) |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta |
|
[default5]:[rank5]: dist.recv( |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper |
|
[default5]:[rank5]: return func(*args, **kwargs) |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv |
|
[default5]:[rank5]: pg.recv([tensor], group_src_rank, tag).wait() |
|
[default5]:[rank5]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer |
|
[default5]:[rank5]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): |
|
[default5]:[rank5]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f32da5d2897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) |
|
[default5]:[rank5]: frame #1: <unknown function> + 0x5b3a23e (0x7f33140ef23e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default5]:[rank5]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::string>, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x2c7 (0x7f33140e9c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default5]:[rank5]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f33140e9f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default5]:[rank5]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f33140eafd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default5]:[rank5]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f331409f371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default5]:[rank5]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f331409f371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default5]:[rank5]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f331409f371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default5]:[rank5]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f331409f371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default5]:[rank5]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f32db8ac189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) |
|
[default5]:[rank5]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f32db8b3610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) |
|
[default5]:[rank5]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector<at::Tensor, std::allocator<at::Tensor> >&, int, int) + 0x5f8 (0x7f32db8d2978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) |
|
[default5]:[rank5]: frame #12: <unknown function> + 0x5adc309 (0x7f3314091309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default5]:[rank5]: frame #13: <unknown function> + 0x5ae6f10 (0x7f331409bf10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default5]:[rank5]: frame #14: <unknown function> + 0x5ae6fa5 (0x7f331409bfa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default5]:[rank5]: frame #15: <unknown function> + 0x5124446 (0x7f33136d9446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default5]:[rank5]: frame #16: <unknown function> + 0x1acf4b8 (0x7f33100844b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default5]:[rank5]: frame #17: <unknown function> + 0x5aee004 (0x7f33140a3004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default5]:[rank5]: frame #18: <unknown function> + 0x5af36b5 (0x7f33140a86b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default5]:[rank5]: frame #19: <unknown function> + 0xd2631e (0x7f3326c9231e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) |
|
[default5]:[rank5]: frame #20: <unknown function> + 0x47def4 (0x7f33263e9ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) |
|
[default5]:[rank5]: frame #21: <unknown function> + 0x1445a6 (0x556fd60445a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #22: _PyObject_MakeTpCall + 0x26b (0x556fd603da6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #23: <unknown function> + 0x150866 (0x556fd6050866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x556fd6039142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #25: _PyFunction_Vectorcall + 0x6c (0x556fd6044a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #26: PyObject_Call + 0xbc (0x556fd6050f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x556fd60372b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #28: _PyFunction_Vectorcall + 0x6c (0x556fd6044a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x556fd60358fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #30: <unknown function> + 0x150582 (0x556fd6050582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x556fd60358fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #32: <unknown function> + 0x150582 (0x556fd6050582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x556fd60358fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #34: <unknown function> + 0x150582 (0x556fd6050582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x556fd60358fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x556fd603cf50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #37: _PyObject_Call_Prepend + 0x69 (0x556fd604ec39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #38: <unknown function> + 0x211239 (0x556fd6111239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #39: _PyObject_MakeTpCall + 0x26b (0x556fd603da6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x556fd60393e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #41: _PyFunction_Vectorcall + 0x6c (0x556fd6044a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x556fd6034c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #43: _PyFunction_Vectorcall + 0x6c (0x556fd6044a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x556fd60358fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #45: <unknown function> + 0x150582 (0x556fd6050582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #46: PyObject_Call + 0xbc (0x556fd6050f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x556fd60372b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #48: <unknown function> + 0x150582 (0x556fd6050582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #49: PyObject_Call + 0xbc (0x556fd6050f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x556fd60372b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #51: _PyFunction_Vectorcall + 0x6c (0x556fd6044a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x556fd603d007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #53: _PyObject_Call_Prepend + 0x69 (0x556fd604ec39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #54: <unknown function> + 0x211239 (0x556fd6111239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #55: PyObject_Call + 0x207 (0x556fd6051067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x556fd60372b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #57: <unknown function> + 0x150582 (0x556fd6050582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x556fd60358fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #59: <unknown function> + 0x150582 (0x556fd6050582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #60: PyObject_Call + 0xbc (0x556fd6050f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x556fd60372b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #62: <unknown function> + 0x150582 (0x556fd6050582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #63: PyObject_Call + 0xbc (0x556fd6050f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: . This may indicate a possible application crash on rank 0 or a network set up issue. |
|
W0702 16:19:15.426000 140256412628800 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3266932 closing signal SIGTERM |
|
W0702 16:19:15.426000 140256412628800 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3266933 closing signal SIGTERM |
|
W0702 16:19:15.426000 140256412628800 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3266934 closing signal SIGTERM |
|
W0702 16:19:15.430000 140256412628800 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3266935 closing signal SIGTERM |
|
W0702 16:19:15.431000 140256412628800 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3266936 closing signal SIGTERM |
|
W0702 16:19:15.432000 140256412628800 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3266937 closing signal SIGTERM |
|
W0702 16:19:15.433000 140256412628800 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3266938 closing signal SIGTERM |
|
E0702 16:19:17.641000 140256412628800 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 3266931) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 |
|
Traceback (most recent call last): |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in <module> |
|
sys.exit(main()) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper |
|
return f(*args, **kwargs) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main |
|
run(args) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run |
|
elastic_launch( |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ |
|
return launch_agent(self._config, self._entrypoint, list(args)) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent |
|
raise ChildFailedError( |
|
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: |
|
============================================================ |
|
/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED |
|
------------------------------------------------------------ |
|
Failures: |
|
<NO_OTHER_FAILURES> |
|
------------------------------------------------------------ |
|
Root Cause (first observed failure): |
|
[0]: |
|
time : 2024-07-02_16:19:15 |
|
host : ip-26-0-171-62.ec2.internal |
|
rank : 0 (local_rank: 0) |
|
exitcode : 1 (pid: 3266931) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
============================================================ |
|
srun: error: ip-26-0-171-62: task 0: Exited with exit code 1 |
|
W0702 16:19:20.399000 140632068052736 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-171-88.ec2.internal_256273_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. |
|
W0702 16:19:20.426000 140637734872896 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 256343 closing signal SIGTERM |
|
W0702 16:19:20.427000 140637734872896 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 256344 closing signal SIGTERM |
|
W0702 16:19:20.427000 140637734872896 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 256346 closing signal SIGTERM |
|
W0702 16:19:20.427000 140637734872896 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 256347 closing signal SIGTERM |
|
W0702 16:19:20.427000 140637734872896 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 256348 closing signal SIGTERM |
|
W0702 16:19:20.428000 140637734872896 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 256349 closing signal SIGTERM |
|
E0702 16:19:21.445000 140637734872896 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 256342) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 |
|
W0702 16:19:21.451000 140637734872896 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-171-88.ec2.internal_256273_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. |
|
W0702 16:19:21.478000 140637734872896 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-171-88.ec2.internal_256273_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. |
|
W0702 16:19:21.489000 140637734872896 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-171-88.ec2.internal_256273_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. |
|
Traceback (most recent call last): |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in <module> |
|
sys.exit(main()) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper |
|
return f(*args, **kwargs) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main |
|
run(args) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run |
|
elastic_launch( |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ |
|
return launch_agent(self._config, self._entrypoint, list(args)) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent |
|
raise ChildFailedError( |
|
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: |
|
============================================================ |
|
/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED |
|
------------------------------------------------------------ |
|
Failures: |
|
[1]: |
|
time : 2024-07-02_16:19:20 |
|
host : ip-26-0-171-88.ec2.internal |
|
rank : 11 (local_rank: 3) |
|
exitcode : 1 (pid: 256345) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
------------------------------------------------------------ |
|
Root Cause (first observed failure): |
|
[0]: |
|
time : 2024-07-02_16:19:20 |
|
host : ip-26-0-171-88.ec2.internal |
|
rank : 8 (local_rank: 0) |
|
exitcode : 1 (pid: 256342) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
============================================================ |
|
srun: error: ip-26-0-171-88: task 1: Exited with exit code 1 |
|
Consider using `hf_transfer` for faster uploads. This solution comes with some limitations. See https://huggingface.co/docs/huggingface_hub/hf_transfer for more details. |
|
|