|
======================== |
|
START TIME: Wed Jul 3 22:45:55 UTC 2024 |
|
python3 version = Python 3.10.14 |
|
======================== |
|
The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well. |
|
Token is valid (permission: write). |
|
Your token has been saved to /admin/home/ferdinand_mom/.cache/huggingface/token |
|
Login successful |
|
Already on 'bench_cluster' |
|
M examples/config_tiny_llama.py |
|
M examples/config_tiny_llama.yaml |
|
M examples/train_tiny_llama.sh |
|
M src/nanotron/models/llama.py |
|
M src/nanotron/trainer.py |
|
Your branch is up to date with 'origin/bench_cluster'. |
|
Job status: RUNNING |
|
W0703 22:45:58.088000 139809641432896 torch/distributed/run.py:757] |
|
W0703 22:45:58.088000 139809641432896 torch/distributed/run.py:757] ***************************************** |
|
W0703 22:45:58.088000 139809641432896 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. |
|
W0703 22:45:58.088000 139809641432896 torch/distributed/run.py:757] ***************************************** |
|
[default0]:07/03/2024 22:46:14 [WARNING|DP=0|PP=0|TP=0|ip-26-0-161-178]: [Vocab Size Padding] Padded vocab (size: 50257) with 1 dummy tokens (new size: 50258) |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: Config: |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: Config(general=GeneralArgs(project='bench_cluster', |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: run='%date_%jobid', |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: seed=42, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: step=None, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: consumed_train_samples=None, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: benchmark_csv_path=None, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: ignore_sanity_checks=True), |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: parallelism=ParallelismArgs(dp=1, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: pp=4, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: tp=2, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: pp_engine=<nanotron.parallel.pipeline_parallel.engine.OneForwardOneBackwardPipelineEngine object at 0x7fa72bcf0730>, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: tp_mode=<TensorParallelLinearMode.REDUCE_SCATTER: 2>, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: tp_linear_async_communication=False, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: expert_parallel_size=1), |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: model=ModelArgs(model_config=LlamaConfig(bos_token_id=1, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: eos_token_id=2, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: hidden_act='silu', |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: hidden_size=2048, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: initializer_range=0.02, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: intermediate_size=4096, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: is_llama_config=True, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: max_position_embeddings=4096, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: num_attention_heads=32, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: num_hidden_layers=24, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: num_key_value_heads=32, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: pad_token_id=None, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: pretraining_tp=1, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: rms_norm_eps=1e-05, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: rope_scaling=None, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: rope_theta=10000.0, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: tie_word_embeddings=True, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: use_cache=True, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: vocab_size=50258), |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: init_method=RandomInit(std=0.025), |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: dtype=torch.bfloat16, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: make_vocab_size_divisible_by=1, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: ddp_bucket_cap_mb=25), |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: tokenizer=TokenizerArgs(tokenizer_name_or_path='openai-community/gpt2', |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: tokenizer_revision=None, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: tokenizer_max_length=None), |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: checkpoints=CheckpointsArgs(checkpoints_path=Path('/dev/null'), |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: checkpoint_interval=100000, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: save_initial_state=False, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: resume_checkpoint_path=None, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: checkpoints_path_is_shared_file_system=False), |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: logging=LoggingArgs(log_level='info', |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: log_level_replica='info', |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: iteration_step_info_interval=1), |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: tokens=TokensArgs(sequence_length=4096, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: train_steps=20, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: micro_batch_size=32, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: batch_accumulation_per_replica=32, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: val_check_interval=-1, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: limit_val_batches=0, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: limit_test_batches=0), |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: optimizer=OptimizerArgs(optimizer_factory=AdamWOptimizerArgs(adam_eps=1e-08, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: adam_beta1=0.9, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: adam_beta2=0.95, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: torch_adam_is_fused=True, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: name='adamW'), |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: zero_stage=1, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: weight_decay=0.01, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: clip_grad=1.0, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: accumulate_grad_in_fp32=True, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: learning_rate_scheduler=LRSchedulerArgs(learning_rate=0.0001, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: lr_warmup_steps=1, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: lr_warmup_style='linear', |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: lr_decay_style='linear', |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: lr_decay_steps=19, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: lr_decay_starting_step=None, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: min_decay_lr=1e-05)), |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: data_stages=[DatasetStageArgs(name='Training Stage', |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: start_training_step=1, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: data=DataArgs(dataset=PretrainDatasetsArgs(hf_dataset_or_datasets='roneneldan/TinyStories', |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: hf_dataset_splits='train', |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: hf_dataset_config_name=None, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: dataset_processing_num_proc_per_process=64, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: dataset_overwrite_cache=False, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: text_column_name='text'), |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: seed=42, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: num_loading_workers=0))], |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: profiler=ProfilerArgs(profiler_export_path=Path('/fsx/ferdinandmom/ferdinand-hf/bench_cluster/results/llama-1B/8_GPUS/dp-1_tp-2_pp-4_mbz-32')), |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: lighteval=None) |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: Model Config: |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: LlamaConfig(bos_token_id=1, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: eos_token_id=2, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: hidden_act='silu', |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: hidden_size=2048, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: initializer_range=0.02, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: intermediate_size=4096, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: is_llama_config=True, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: max_position_embeddings=4096, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: num_attention_heads=32, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: num_hidden_layers=24, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: num_key_value_heads=32, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: pad_token_id=None, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: pretraining_tp=1, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: rms_norm_eps=1e-05, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: rope_scaling=None, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: rope_theta=10000.0, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: tie_word_embeddings=True, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: use_cache=True, |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: vocab_size=50258) |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: Building model.. |
|
[default0]:07/03/2024 22:46:14 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: Setting PP block ranks... |
|
[default5]:07/03/2024 22:46:28 [INFO|DP=0|PP=2|TP=1|ip-26-0-161-178]: Local number of parameters: 126M (240.05MiB) |
|
[default5]:07/03/2024 22:46:28 [INFO|DP=0|PP=2|TP=1|ip-26-0-161-178]: [After model building] Memory usage: 246.06MiB. Peak allocated: 248.09MiB Peak reserved: 262.00MiB |
|
[default5]:07/03/2024 22:46:28 [INFO|DP=0|PP=2|TP=1|ip-26-0-161-178]: No checkpoint path provided. |
|
[default1]:07/03/2024 22:46:28 [INFO|DP=0|PP=0|TP=1|ip-26-0-161-178]: Local number of parameters: 198M (378.21MiB) |
|
[default1]:07/03/2024 22:46:28 [INFO|DP=0|PP=0|TP=1|ip-26-0-161-178]: [After model building] Memory usage: 385.23MiB. Peak allocated: 387.26MiB Peak reserved: 402.00MiB |
|
[default1]:07/03/2024 22:46:28 [INFO|DP=0|PP=0|TP=1|ip-26-0-161-178]: No checkpoint path provided. |
|
[default0]:07/03/2024 22:46:28 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: Total number of parameters: 1.21G (2313.02MiB) |
|
[default0]:07/03/2024 22:46:28 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: Local number of parameters: 198M (378.21MiB) |
|
[default0]:07/03/2024 22:46:28 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: [After model building] Memory usage: 385.23MiB. Peak allocated: 387.26MiB Peak reserved: 402.00MiB |
|
[default3]:07/03/2024 22:46:28 [INFO|DP=0|PP=1|TP=1|ip-26-0-161-178]: Local number of parameters: 147M (280.05MiB) |
|
[default3]:07/03/2024 22:46:28 [INFO|DP=0|PP=1|TP=1|ip-26-0-161-178]: [After model building] Memory usage: 287.07MiB. Peak allocated: 289.10MiB Peak reserved: 302.00MiB |
|
[default3]:07/03/2024 22:46:28 [INFO|DP=0|PP=1|TP=1|ip-26-0-161-178]: No checkpoint path provided. |
|
[default0]:07/03/2024 22:46:28 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: No checkpoint path provided. |
|
[default0]:07/03/2024 22:46:28 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: Parametrizing model parameters using StandardParametrizator |
|
[default2]:07/03/2024 22:46:28 [INFO|DP=0|PP=1|TP=0|ip-26-0-161-178]: Local number of parameters: 147M (280.05MiB) |
|
[default2]:07/03/2024 22:46:28 [INFO|DP=0|PP=1|TP=0|ip-26-0-161-178]: [After model building] Memory usage: 287.07MiB. Peak allocated: 289.10MiB Peak reserved: 302.00MiB |
|
[default2]:07/03/2024 22:46:28 [INFO|DP=0|PP=1|TP=0|ip-26-0-161-178]: No checkpoint path provided. |
|
[default7]:07/03/2024 22:46:28 [INFO|DP=0|PP=3|TP=1|ip-26-0-161-178]: Local number of parameters: 135M (258.20MiB) |
|
[default7]:07/03/2024 22:46:28 [INFO|DP=0|PP=3|TP=1|ip-26-0-161-178]: [After model building] Memory usage: 262.21MiB. Peak allocated: 264.24MiB Peak reserved: 280.00MiB |
|
[default7]:07/03/2024 22:46:28 [INFO|DP=0|PP=3|TP=1|ip-26-0-161-178]: No checkpoint path provided. |
|
[default4]:07/03/2024 22:46:28 [INFO|DP=0|PP=2|TP=0|ip-26-0-161-178]: Local number of parameters: 126M (240.05MiB) |
|
[default4]:07/03/2024 22:46:28 [INFO|DP=0|PP=2|TP=0|ip-26-0-161-178]: [After model building] Memory usage: 246.06MiB. Peak allocated: 248.09MiB Peak reserved: 262.00MiB |
|
[default4]:07/03/2024 22:46:28 [INFO|DP=0|PP=2|TP=0|ip-26-0-161-178]: No checkpoint path provided. |
|
[default6]:07/03/2024 22:46:28 [INFO|DP=0|PP=3|TP=0|ip-26-0-161-178]: Local number of parameters: 135M (258.20MiB) |
|
[default6]:07/03/2024 22:46:28 [INFO|DP=0|PP=3|TP=0|ip-26-0-161-178]: [After model building] Memory usage: 262.21MiB. Peak allocated: 264.24MiB Peak reserved: 280.00MiB |
|
[default6]:07/03/2024 22:46:28 [INFO|DP=0|PP=3|TP=0|ip-26-0-161-178]: No checkpoint path provided. |
|
[default0]:07/03/2024 22:46:29 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: [Optimizer Building] Using LearningRateForSP as learning rate |
|
[default0]:07/03/2024 22:46:29 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: [ZeRO sharding] Size of optimizer params per rank: |
|
[default0]:07/03/2024 22:46:29 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: [ZeRO sharding] DP Rank 0 has 198M out of 198M (100.00%) params' optimizer states |
|
[default0]:07/03/2024 22:46:30 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: [Training Plan] Stage Training Stage has 19 remaining training steps and has consumed 0 samples |
|
[default0]:07/03/2024 22:46:30 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: Using `datasets` library |
|
[default0]:07/03/2024 22:46:30 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: Loading tokenizer from openai-community/gpt2 and transformers/hf_hub versions ('4.41.2', '0.23.4') |
|
[default0]:Repo card metadata block was not found. Setting CardData to empty. |
|
[default0]:07/03/2024 22:46:30 [WARNING|DP=0|PP=0|TP=0|ip-26-0-161-178]: Repo card metadata block was not found. Setting CardData to empty. |
|
[default0]:07/03/2024 22:46:31 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: [Training Plan] There are 1 training stages |
|
[default0]:07/03/2024 22:46:31 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: [Stage Training Stage] start from step 1 |
|
[default0]:07/03/2024 22:46:31 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: |
|
[default0]:07/03/2024 22:46:31 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: [Start training] datetime: 2024-07-03 22:46:31.601454 | mbs: 32 | grad_accum: 32 | global_batch_size: 1024 | sequence_length: 4096 | train_steps: 20 | start_iteration_step: 0 | consumed_train_samples: 0 |
|
[default0]:07/03/2024 22:46:31 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: Resuming training from stage Training Stage, it has trained for 0 samples and has 19 remaining train steps |
|
[default0]:07/03/2024 22:46:31 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: Memory usage: 1898.09MiB. Peak allocated 1898.09MiB. Peak reserved: 1918.00MiB |
|
[default3]:07/03/2024 22:46:31 [WARNING|DP=0|PP=1|TP=1|ip-26-0-161-178]: Repo card metadata block was not found. Setting CardData to empty. |
|
[default2]:07/03/2024 22:46:31 [WARNING|DP=0|PP=1|TP=0|ip-26-0-161-178]: Repo card metadata block was not found. Setting CardData to empty. |
|
[default4]:07/03/2024 22:46:31 [WARNING|DP=0|PP=2|TP=0|ip-26-0-161-178]: Repo card metadata block was not found. Setting CardData to empty. |
|
[default3]:Repo card metadata block was not found. Setting CardData to empty. |
|
[default2]:Repo card metadata block was not found. Setting CardData to empty. |
|
[default4]:Repo card metadata block was not found. Setting CardData to empty. |
|
[default5]:07/03/2024 22:46:31 [WARNING|DP=0|PP=2|TP=1|ip-26-0-161-178]: Repo card metadata block was not found. Setting CardData to empty. |
|
[default1]:07/03/2024 22:46:31 [WARNING|DP=0|PP=0|TP=1|ip-26-0-161-178]: Repo card metadata block was not found. Setting CardData to empty. |
|
[default1]:Repo card metadata block was not found. Setting CardData to empty. |
|
[default7]:07/03/2024 22:46:31 [WARNING|DP=0|PP=3|TP=1|ip-26-0-161-178]: Repo card metadata block was not found. Setting CardData to empty. |
|
[default5]:Repo card metadata block was not found. Setting CardData to empty. |
|
[default6]:07/03/2024 22:46:31 [WARNING|DP=0|PP=3|TP=0|ip-26-0-161-178]: Repo card metadata block was not found. Setting CardData to empty. |
|
[default7]:Repo card metadata block was not found. Setting CardData to empty. |
|
[default6]:Repo card metadata block was not found. Setting CardData to empty. |
|
[default0]:[rank0]: Traceback (most recent call last): |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in <module> |
|
[default0]:[rank0]: trainer.train(dataloader) |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train |
|
[default0]:[rank0]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step |
|
[default0]:[rank0]: outputs = self.pipeline_engine.train_batch_iter( |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter |
|
[default0]:[rank0]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward |
|
[default0]:[rank0]: output = model(**micro_batch) |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default0]:[rank0]: return self._call_impl(*args, **kwargs) |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default0]:[rank0]: return forward_call(*args, **kwargs) |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward |
|
[default0]:[rank0]: sharded_logits = self.model( |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default0]:[rank0]: return self._call_impl(*args, **kwargs) |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default0]:[rank0]: return forward_call(*args, **kwargs) |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward |
|
[default0]:[rank0]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states |
|
[default0]:[rank0]: hidden_encoder_states = encoder_block(**hidden_encoder_states) |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default0]:[rank0]: return self._call_impl(*args, **kwargs) |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default0]:[rank0]: return forward_call(*args, **kwargs) |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward |
|
[default0]:[rank0]: output = self.pp_block(**new_kwargs) |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default0]:[rank0]: return self._call_impl(*args, **kwargs) |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default0]:[rank0]: return forward_call(*args, **kwargs) |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 637, in forward |
|
[default0]:[rank0]: hidden_states = self.mlp(hidden_states=hidden_states)["hidden_states"] |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default0]:[rank0]: return self._call_impl(*args, **kwargs) |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default0]:[rank0]: return forward_call(*args, **kwargs) |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 172, in forward |
|
[default0]:[rank0]: hidden_states = self.down_proj(self.split_silu_mul(merged_states)) |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default0]:[rank0]: return self._call_impl(*args, **kwargs) |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default0]:[rank0]: return forward_call(*args, **kwargs) |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward |
|
[default0]:[rank0]: return row_linear( |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 474, in row_linear |
|
[default0]:[rank0]: out = F.linear(input, weight, bias) |
|
[default0]:[rank0]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB. GPU |
|
[default1]:[rank1]: Traceback (most recent call last): |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in <module> |
|
[default1]:[rank1]: trainer.train(dataloader) |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train |
|
[default1]:[rank1]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step |
|
[default1]:[rank1]: outputs = self.pipeline_engine.train_batch_iter( |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter |
|
[default1]:[rank1]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward |
|
[default1]:[rank1]: output = model(**micro_batch) |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default1]:[rank1]: return self._call_impl(*args, **kwargs) |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default1]:[rank1]: return forward_call(*args, **kwargs) |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward |
|
[default1]:[rank1]: sharded_logits = self.model( |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default1]:[rank1]: return self._call_impl(*args, **kwargs) |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default1]:[rank1]: return forward_call(*args, **kwargs) |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward |
|
[default1]:[rank1]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states |
|
[default1]:[rank1]: hidden_encoder_states = encoder_block(**hidden_encoder_states) |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default1]:[rank1]: return self._call_impl(*args, **kwargs) |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default1]:[rank1]: return forward_call(*args, **kwargs) |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward |
|
[default1]:[rank1]: output = self.pp_block(**new_kwargs) |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default1]:[rank1]: return self._call_impl(*args, **kwargs) |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default1]:[rank1]: return forward_call(*args, **kwargs) |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 637, in forward |
|
[default1]:[rank1]: hidden_states = self.mlp(hidden_states=hidden_states)["hidden_states"] |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default1]:[rank1]: return self._call_impl(*args, **kwargs) |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default1]:[rank1]: return forward_call(*args, **kwargs) |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 172, in forward |
|
[default1]:[rank1]: hidden_states = self.down_proj(self.split_silu_mul(merged_states)) |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default1]:[rank1]: return self._call_impl(*args, **kwargs) |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default1]:[rank1]: return forward_call(*args, **kwargs) |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward |
|
[default1]:[rank1]: return row_linear( |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 474, in row_linear |
|
[default1]:[rank1]: out = F.linear(input, weight, bias) |
|
[default1]:[rank1]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB. GPU has a total capacity of 79.33 GiB of which 189.94 MiB is free. Including non-PyTorch memory, this process has 79.13 GiB memory in use. Of the allocated memory 65.49 GiB is allocated by PyTorch, and 690.90 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) |
|
[default5]:[rank5]: Traceback (most recent call last): |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in <module> |
|
[default5]:[rank5]: trainer.train(dataloader) |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train |
|
[default5]:[rank5]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step |
|
[default5]:[rank5]: outputs = self.pipeline_engine.train_batch_iter( |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter |
|
[default5]:[rank5]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward |
|
[default5]:[rank5]: output = model(**micro_batch) |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default5]:[rank5]: return self._call_impl(*args, **kwargs) |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default5]:[rank5]: return forward_call(*args, **kwargs) |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward |
|
[default5]:[rank5]: sharded_logits = self.model( |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default5]:[rank5]: return self._call_impl(*args, **kwargs) |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default5]:[rank5]: return forward_call(*args, **kwargs) |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward |
|
[default5]:[rank5]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states |
|
[default5]:[rank5]: hidden_encoder_states = encoder_block(**hidden_encoder_states) |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default5]:[rank5]: return self._call_impl(*args, **kwargs) |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default5]:[rank5]: return forward_call(*args, **kwargs) |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward |
|
[default5]:[rank5]: new_kwargs[name] = recv_from_pipeline_state_buffer( |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer |
|
[default5]:[rank5]: pipeline_state.run_communication() |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication |
|
[default5]:[rank5]: recv_activation_tensor = recv_activation() |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ |
|
[default5]:[rank5]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors |
|
[default5]:[rank5]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors |
|
[default5]:[rank5]: meta = self._recv_meta(from_rank=from_rank, tag=tag) |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta |
|
[default5]:[rank5]: dist.recv( |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper |
|
[default5]:[rank5]: return func(*args, **kwargs) |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv |
|
[default5]:[rank5]: pg.recv([tensor], group_src_rank, tag).wait() |
|
[default5]:[rank5]: torch.distributed.DistBackendError: [2] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '1:2', but store->get('1:2') got error: Connection reset by peer |
|
[default5]:[rank5]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): |
|
[default5]:[rank5]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f4124b6a897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) |
|
[default5]:[rank5]: frame #1: <unknown function> + 0x5b3a23e (0x7f415e68723e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default5]:[rank5]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::string>, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x2c7 (0x7f415e681c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default5]:[rank5]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f415e681f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default5]:[rank5]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f415e682fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default5]:[rank5]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f415e637371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default5]:[rank5]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f415e637371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default5]:[rank5]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f415e637371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default5]:[rank5]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f415e637371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default5]:[rank5]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f4125e44189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) |
|
[default5]:[rank5]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f4125e4b610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) |
|
[default5]:[rank5]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector<at::Tensor, std::allocator<at::Tensor> >&, int, int) + 0x5f8 (0x7f4125e6a978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) |
|
[default5]:[rank5]: frame #12: <unknown function> + 0x5adc309 (0x7f415e629309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default5]:[rank5]: frame #13: <unknown function> + 0x5ae6f10 (0x7f415e633f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default5]:[rank5]: frame #14: <unknown function> + 0x5ae6fa5 (0x7f415e633fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default5]:[rank5]: frame #15: <unknown function> + 0x5124446 (0x7f415dc71446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default5]:[rank5]: frame #16: <unknown function> + 0x1acf4b8 (0x7f415a61c4b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default5]:[rank5]: frame #17: <unknown function> + 0x5aee004 (0x7f415e63b004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default5]:[rank5]: frame #18: <unknown function> + 0x5af36b5 (0x7f415e6406b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default5]:[rank5]: frame #19: <unknown function> + 0xd2631e (0x7f417122a31e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) |
|
[default5]:[rank5]: frame #20: <unknown function> + 0x47def4 (0x7f4170981ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) |
|
[default5]:[rank5]: frame #21: <unknown function> + 0x1445a6 (0x555d5fee15a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #22: _PyObject_MakeTpCall + 0x26b (0x555d5fedaa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #23: <unknown function> + 0x150866 (0x555d5feed866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x555d5fed6142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #25: _PyFunction_Vectorcall + 0x6c (0x555d5fee1a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #26: PyObject_Call + 0xbc (0x555d5feedf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x555d5fed42b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #28: _PyFunction_Vectorcall + 0x6c (0x555d5fee1a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x555d5fed28fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #30: <unknown function> + 0x150582 (0x555d5feed582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x555d5fed28fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #32: <unknown function> + 0x150582 (0x555d5feed582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x555d5fed28fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #34: <unknown function> + 0x150582 (0x555d5feed582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x555d5fed28fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x555d5fed9f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #37: _PyObject_Call_Prepend + 0x69 (0x555d5feebc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #38: <unknown function> + 0x211239 (0x555d5ffae239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #39: _PyObject_MakeTpCall + 0x26b (0x555d5fedaa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x555d5fed63e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #41: _PyFunction_Vectorcall + 0x6c (0x555d5fee1a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x555d5fed1c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #43: _PyFunction_Vectorcall + 0x6c (0x555d5fee1a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x555d5fed28fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #45: <unknown function> + 0x150582 (0x555d5feed582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #46: PyObject_Call + 0xbc (0x555d5feedf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x555d5fed42b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #48: <unknown function> + 0x150582 (0x555d5feed582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #49: PyObject_Call + 0xbc (0x555d5feedf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x555d5fed42b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #51: _PyFunction_Vectorcall + 0x6c (0x555d5fee1a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x555d5feda007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #53: _PyObject_Call_Prepend + 0x69 (0x555d5feebc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #54: <unknown function> + 0x211239 (0x555d5ffae239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #55: PyObject_Call + 0x207 (0x555d5feee067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x555d5fed42b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #57: <unknown function> + 0x150582 (0x555d5feed582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x555d5fed28fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #59: <unknown function> + 0x150582 (0x555d5feed582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #60: PyObject_Call + 0xbc (0x555d5feedf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x555d5fed42b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #62: <unknown function> + 0x150582 (0x555d5feed582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #63: PyObject_Call + 0xbc (0x555d5feedf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: . This may indicate a possible application crash on rank 0 or a network set up issue. |
|
[default7]:[rank7]: Traceback (most recent call last): |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in <module> |
|
[default7]:[rank7]: trainer.train(dataloader) |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train |
|
[default7]:[rank7]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step |
|
[default7]:[rank7]: outputs = self.pipeline_engine.train_batch_iter( |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter |
|
[default7]:[rank7]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward |
|
[default7]:[rank7]: output = model(**micro_batch) |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default7]:[rank7]: return self._call_impl(*args, **kwargs) |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default7]:[rank7]: return forward_call(*args, **kwargs) |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward |
|
[default7]:[rank7]: sharded_logits = self.model( |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default7]:[rank7]: return self._call_impl(*args, **kwargs) |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default7]:[rank7]: return forward_call(*args, **kwargs) |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward |
|
[default7]:[rank7]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states |
|
[default7]:[rank7]: hidden_encoder_states = encoder_block(**hidden_encoder_states) |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default7]:[rank7]: return self._call_impl(*args, **kwargs) |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default7]:[rank7]: return forward_call(*args, **kwargs) |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward |
|
[default7]:[rank7]: new_kwargs[name] = recv_from_pipeline_state_buffer( |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer |
|
[default7]:[rank7]: pipeline_state.run_communication() |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication |
|
[default7]:[rank7]: recv_activation_tensor = recv_activation() |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ |
|
[default7]:[rank7]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors |
|
[default7]:[rank7]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors |
|
[default7]:[rank7]: meta = self._recv_meta(from_rank=from_rank, tag=tag) |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta |
|
[default7]:[rank7]: dist.recv( |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper |
|
[default7]:[rank7]: return func(*args, **kwargs) |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv |
|
[default7]:[rank7]: pg.recv([tensor], group_src_rank, tag).wait() |
|
[default7]:[rank7]: torch.distributed.DistBackendError: [3] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '2:3', but store->get('2:3') got error: Connection reset by peer |
|
[default7]:[rank7]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): |
|
[default7]:[rank7]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f24ccf58897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) |
|
[default7]:[rank7]: frame #1: <unknown function> + 0x5b3a23e (0x7f2506a7523e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default7]:[rank7]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::string>, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x2c7 (0x7f2506a6fc87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default7]:[rank7]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f2506a6ff82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default7]:[rank7]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f2506a70fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default7]:[rank7]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2506a25371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default7]:[rank7]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2506a25371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default7]:[rank7]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2506a25371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default7]:[rank7]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2506a25371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default7]:[rank7]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f24ce232189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) |
|
[default7]:[rank7]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f24ce239610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) |
|
[default7]:[rank7]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector<at::Tensor, std::allocator<at::Tensor> >&, int, int) + 0x5f8 (0x7f24ce258978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) |
|
[default7]:[rank7]: frame #12: <unknown function> + 0x5adc309 (0x7f2506a17309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default7]:[rank7]: frame #13: <unknown function> + 0x5ae6f10 (0x7f2506a21f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default7]:[rank7]: frame #14: <unknown function> + 0x5ae6fa5 (0x7f2506a21fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default7]:[rank7]: frame #15: <unknown function> + 0x5124446 (0x7f250605f446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default7]:[rank7]: frame #16: <unknown function> + 0x1acf4b8 (0x7f2502a0a4b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default7]:[rank7]: frame #17: <unknown function> + 0x5aee004 (0x7f2506a29004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default7]:[rank7]: frame #18: <unknown function> + 0x5af36b5 (0x7f2506a2e6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default7]:[rank7]: frame #19: <unknown function> + 0xd2631e (0x7f251961831e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) |
|
[default7]:[rank7]: frame #20: <unknown function> + 0x47def4 (0x7f2518d6fef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) |
|
[default7]:[rank7]: frame #21: <unknown function> + 0x1445a6 (0x5588df5965a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #22: _PyObject_MakeTpCall + 0x26b (0x5588df58fa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #23: <unknown function> + 0x150866 (0x5588df5a2866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x5588df58b142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #25: _PyFunction_Vectorcall + 0x6c (0x5588df596a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #26: PyObject_Call + 0xbc (0x5588df5a2f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5588df5892b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #28: _PyFunction_Vectorcall + 0x6c (0x5588df596a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5588df5878fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #30: <unknown function> + 0x150582 (0x5588df5a2582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5588df5878fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #32: <unknown function> + 0x150582 (0x5588df5a2582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5588df5878fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #34: <unknown function> + 0x150582 (0x5588df5a2582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5588df5878fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x5588df58ef50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #37: _PyObject_Call_Prepend + 0x69 (0x5588df5a0c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #38: <unknown function> + 0x211239 (0x5588df663239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #39: _PyObject_MakeTpCall + 0x26b (0x5588df58fa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5588df58b3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #41: _PyFunction_Vectorcall + 0x6c (0x5588df596a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x5588df586c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #43: _PyFunction_Vectorcall + 0x6c (0x5588df596a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5588df5878fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #45: <unknown function> + 0x150582 (0x5588df5a2582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #46: PyObject_Call + 0xbc (0x5588df5a2f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5588df5892b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #48: <unknown function> + 0x150582 (0x5588df5a2582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #49: PyObject_Call + 0xbc (0x5588df5a2f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5588df5892b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #51: _PyFunction_Vectorcall + 0x6c (0x5588df596a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x5588df58f007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #53: _PyObject_Call_Prepend + 0x69 (0x5588df5a0c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #54: <unknown function> + 0x211239 (0x5588df663239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #55: PyObject_Call + 0x207 (0x5588df5a3067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5588df5892b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #57: <unknown function> + 0x150582 (0x5588df5a2582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5588df5878fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #59: <unknown function> + 0x150582 (0x5588df5a2582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #60: PyObject_Call + 0xbc (0x5588df5a2f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5588df5892b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #62: <unknown function> + 0x150582 (0x5588df5a2582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #63: PyObject_Call + 0xbc (0x5588df5a2f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: . This may indicate a possible application crash on rank 0 or a network set up issue. |
|
[default4]:[rank4]: Traceback (most recent call last): |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in <module> |
|
[default4]:[rank4]: trainer.train(dataloader) |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train |
|
[default4]:[rank4]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step |
|
[default4]:[rank4]: outputs = self.pipeline_engine.train_batch_iter( |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter |
|
[default4]:[rank4]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward |
|
[default4]:[rank4]: output = model(**micro_batch) |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default4]:[rank4]: return self._call_impl(*args, **kwargs) |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default4]:[rank4]: return forward_call(*args, **kwargs) |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward |
|
[default4]:[rank4]: sharded_logits = self.model( |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default4]:[rank4]: return self._call_impl(*args, **kwargs) |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default4]:[rank4]: return forward_call(*args, **kwargs) |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward |
|
[default4]:[rank4]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states |
|
[default4]:[rank4]: hidden_encoder_states = encoder_block(**hidden_encoder_states) |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default4]:[rank4]: return self._call_impl(*args, **kwargs) |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default4]:[rank4]: return forward_call(*args, **kwargs) |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward |
|
[default4]:[rank4]: new_kwargs[name] = recv_from_pipeline_state_buffer( |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer |
|
[default4]:[rank4]: pipeline_state.run_communication() |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication |
|
[default4]:[rank4]: recv_activation_tensor = recv_activation() |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ |
|
[default4]:[rank4]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors |
|
[default4]:[rank4]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors |
|
[default4]:[rank4]: meta = self._recv_meta(from_rank=from_rank, tag=tag) |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta |
|
[default4]:[rank4]: dist.recv( |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper |
|
[default4]:[rank4]: return func(*args, **kwargs) |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv |
|
[default4]:[rank4]: pg.recv([tensor], group_src_rank, tag).wait() |
|
[default4]:[rank4]: torch.distributed.DistBackendError: [2] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '1:2', but store->get('1:2') got error: Connection reset by peer |
|
[default4]:[rank4]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): |
|
[default4]:[rank4]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc5d1eb4897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) |
|
[default4]:[rank4]: frame #1: <unknown function> + 0x5b3a23e (0x7fc60b9d123e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default4]:[rank4]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::string>, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x2c7 (0x7fc60b9cbc87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default4]:[rank4]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fc60b9cbf82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default4]:[rank4]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fc60b9ccfd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default4]:[rank4]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fc60b981371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default4]:[rank4]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fc60b981371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default4]:[rank4]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fc60b981371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default4]:[rank4]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fc60b981371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default4]:[rank4]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fc5d318e189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) |
|
[default4]:[rank4]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7fc5d3195610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) |
|
[default4]:[rank4]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector<at::Tensor, std::allocator<at::Tensor> >&, int, int) + 0x5f8 (0x7fc5d31b4978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) |
|
[default4]:[rank4]: frame #12: <unknown function> + 0x5adc309 (0x7fc60b973309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default4]:[rank4]: frame #13: <unknown function> + 0x5ae6f10 (0x7fc60b97df10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default4]:[rank4]: frame #14: <unknown function> + 0x5ae6fa5 (0x7fc60b97dfa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default4]:[rank4]: frame #15: <unknown function> + 0x5124446 (0x7fc60afbb446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default4]:[rank4]: frame #16: <unknown function> + 0x1acf4b8 (0x7fc6079664b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default4]:[rank4]: frame #17: <unknown function> + 0x5aee004 (0x7fc60b985004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default4]:[rank4]: frame #18: <unknown function> + 0x5af36b5 (0x7fc60b98a6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default4]:[rank4]: frame #19: <unknown function> + 0xd2631e (0x7fc61e57431e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) |
|
[default4]:[rank4]: frame #20: <unknown function> + 0x47def4 (0x7fc61dccbef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) |
|
[default4]:[rank4]: frame #21: <unknown function> + 0x1445a6 (0x55bca24f05a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55bca24e9a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #23: <unknown function> + 0x150866 (0x55bca24fc866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55bca24e5142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55bca24f0a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #26: PyObject_Call + 0xbc (0x55bca24fcf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55bca24e32b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55bca24f0a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55bca24e18fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #30: <unknown function> + 0x150582 (0x55bca24fc582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55bca24e18fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #32: <unknown function> + 0x150582 (0x55bca24fc582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55bca24e18fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #34: <unknown function> + 0x150582 (0x55bca24fc582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55bca24e18fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55bca24e8f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55bca24fac39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #38: <unknown function> + 0x211239 (0x55bca25bd239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55bca24e9a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55bca24e53e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55bca24f0a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55bca24e0c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55bca24f0a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55bca24e18fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #45: <unknown function> + 0x150582 (0x55bca24fc582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #46: PyObject_Call + 0xbc (0x55bca24fcf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55bca24e32b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #48: <unknown function> + 0x150582 (0x55bca24fc582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #49: PyObject_Call + 0xbc (0x55bca24fcf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55bca24e32b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55bca24f0a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55bca24e9007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55bca24fac39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #54: <unknown function> + 0x211239 (0x55bca25bd239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #55: PyObject_Call + 0x207 (0x55bca24fd067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55bca24e32b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #57: <unknown function> + 0x150582 (0x55bca24fc582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55bca24e18fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #59: <unknown function> + 0x150582 (0x55bca24fc582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #60: PyObject_Call + 0xbc (0x55bca24fcf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55bca24e32b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #62: <unknown function> + 0x150582 (0x55bca24fc582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #63: PyObject_Call + 0xbc (0x55bca24fcf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: . This may indicate a possible application crash on rank 0 or a network set up issue. |
|
[default6]:[rank6]: Traceback (most recent call last): |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in <module> |
|
[default6]:[rank6]: trainer.train(dataloader) |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train |
|
[default6]:[rank6]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step |
|
[default6]:[rank6]: outputs = self.pipeline_engine.train_batch_iter( |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter |
|
[default6]:[rank6]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward |
|
[default6]:[rank6]: output = model(**micro_batch) |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default6]:[rank6]: return self._call_impl(*args, **kwargs) |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default6]:[rank6]: return forward_call(*args, **kwargs) |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward |
|
[default6]:[rank6]: sharded_logits = self.model( |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default6]:[rank6]: return self._call_impl(*args, **kwargs) |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default6]:[rank6]: return forward_call(*args, **kwargs) |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward |
|
[default6]:[rank6]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states |
|
[default6]:[rank6]: hidden_encoder_states = encoder_block(**hidden_encoder_states) |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default6]:[rank6]: return self._call_impl(*args, **kwargs) |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default6]:[rank6]: return forward_call(*args, **kwargs) |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward |
|
[default6]:[rank6]: new_kwargs[name] = recv_from_pipeline_state_buffer( |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer |
|
[default6]:[rank6]: pipeline_state.run_communication() |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication |
|
[default6]:[rank6]: recv_activation_tensor = recv_activation() |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ |
|
[default6]:[rank6]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors |
|
[default6]:[rank6]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors |
|
[default6]:[rank6]: meta = self._recv_meta(from_rank=from_rank, tag=tag) |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta |
|
[default6]:[rank6]: dist.recv( |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper |
|
[default6]:[rank6]: return func(*args, **kwargs) |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv |
|
[default6]:[rank6]: pg.recv([tensor], group_src_rank, tag).wait() |
|
[default6]:[rank6]: torch.distributed.DistBackendError: [3] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '2:3', but store->get('2:3') got error: Connection reset by peer |
|
[default6]:[rank6]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): |
|
[default6]:[rank6]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f4364845897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) |
|
[default6]:[rank6]: frame #1: <unknown function> + 0x5b3a23e (0x7f439e36223e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default6]:[rank6]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::string>, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x2c7 (0x7f439e35cc87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default6]:[rank6]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f439e35cf82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default6]:[rank6]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f439e35dfd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default6]:[rank6]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f439e312371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default6]:[rank6]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f439e312371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default6]:[rank6]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f439e312371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default6]:[rank6]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f439e312371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default6]:[rank6]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f4365b1f189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) |
|
[default6]:[rank6]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f4365b26610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) |
|
[default6]:[rank6]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector<at::Tensor, std::allocator<at::Tensor> >&, int, int) + 0x5f8 (0x7f4365b45978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) |
|
[default6]:[rank6]: frame #12: <unknown function> + 0x5adc309 (0x7f439e304309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default6]:[rank6]: frame #13: <unknown function> + 0x5ae6f10 (0x7f439e30ef10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default6]:[rank6]: frame #14: <unknown function> + 0x5ae6fa5 (0x7f439e30efa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default6]:[rank6]: frame #15: <unknown function> + 0x5124446 (0x7f439d94c446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default6]:[rank6]: frame #16: <unknown function> + 0x1acf4b8 (0x7f439a2f74b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default6]:[rank6]: frame #17: <unknown function> + 0x5aee004 (0x7f439e316004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default6]:[rank6]: frame #18: <unknown function> + 0x5af36b5 (0x7f439e31b6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default6]:[rank6]: frame #19: <unknown function> + 0xd2631e (0x7f43b0f0531e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) |
|
[default6]:[rank6]: frame #20: <unknown function> + 0x47def4 (0x7f43b065cef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) |
|
[default6]:[rank6]: frame #21: <unknown function> + 0x1445a6 (0x55b1f06785a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55b1f0671a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #23: <unknown function> + 0x150866 (0x55b1f0684866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55b1f066d142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55b1f0678a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #26: PyObject_Call + 0xbc (0x55b1f0684f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55b1f066b2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55b1f0678a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55b1f06698fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #30: <unknown function> + 0x150582 (0x55b1f0684582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55b1f06698fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #32: <unknown function> + 0x150582 (0x55b1f0684582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55b1f06698fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #34: <unknown function> + 0x150582 (0x55b1f0684582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55b1f06698fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55b1f0670f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55b1f0682c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #38: <unknown function> + 0x211239 (0x55b1f0745239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55b1f0671a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55b1f066d3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55b1f0678a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55b1f0668c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55b1f0678a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55b1f06698fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #45: <unknown function> + 0x150582 (0x55b1f0684582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #46: PyObject_Call + 0xbc (0x55b1f0684f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55b1f066b2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #48: <unknown function> + 0x150582 (0x55b1f0684582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #49: PyObject_Call + 0xbc (0x55b1f0684f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55b1f066b2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55b1f0678a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55b1f0671007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55b1f0682c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #54: <unknown function> + 0x211239 (0x55b1f0745239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #55: PyObject_Call + 0x207 (0x55b1f0685067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55b1f066b2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #57: <unknown function> + 0x150582 (0x55b1f0684582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55b1f06698fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #59: <unknown function> + 0x150582 (0x55b1f0684582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #60: PyObject_Call + 0xbc (0x55b1f0684f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55b1f066b2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #62: <unknown function> + 0x150582 (0x55b1f0684582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #63: PyObject_Call + 0xbc (0x55b1f0684f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: . This may indicate a possible application crash on rank 0 or a network set up issue. |
|
W0703 22:46:43.189000 139809641432896 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1029972 closing signal SIGTERM |
|
W0703 22:46:43.189000 139809641432896 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1029973 closing signal SIGTERM |
|
W0703 22:46:43.190000 139809641432896 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1029974 closing signal SIGTERM |
|
W0703 22:46:43.190000 139809641432896 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1029975 closing signal SIGTERM |
|
W0703 22:46:43.191000 139809641432896 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1029976 closing signal SIGTERM |
|
W0703 22:46:43.191000 139809641432896 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1029977 closing signal SIGTERM |
|
W0703 22:46:43.191000 139809641432896 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1029978 closing signal SIGTERM |
|
E0703 22:46:45.012000 139809641432896 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 1029971) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 |
|
Traceback (most recent call last): |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in <module> |
|
sys.exit(main()) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper |
|
return f(*args, **kwargs) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main |
|
run(args) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run |
|
elastic_launch( |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ |
|
return launch_agent(self._config, self._entrypoint, list(args)) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent |
|
raise ChildFailedError( |
|
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: |
|
============================================================ |
|
/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED |
|
------------------------------------------------------------ |
|
Failures: |
|
<NO_OTHER_FAILURES> |
|
------------------------------------------------------------ |
|
Root Cause (first observed failure): |
|
[0]: |
|
time : 2024-07-03_22:46:43 |
|
host : ip-26-0-161-178.ec2.internal |
|
rank : 0 (local_rank: 0) |
|
exitcode : 1 (pid: 1029971) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
============================================================ |
|
srun: error: ip-26-0-161-178: task 0: Exited with exit code 1 |
|
Consider using `hf_transfer` for faster uploads. This solution comes with some limitations. See https://huggingface.co/docs/huggingface_hub/hf_transfer for more details. |
|
|