|
======================== |
|
START TIME: Wed Jul 3 22:53:13 UTC 2024 |
|
python3 version = Python 3.10.14 |
|
======================== |
|
The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well. |
|
Token is valid (permission: write). |
|
Your token has been saved to /admin/home/ferdinand_mom/.cache/huggingface/token |
|
Login successful |
|
Already on 'bench_cluster' |
|
M examples/config_tiny_llama.py |
|
M examples/config_tiny_llama.yaml |
|
M examples/train_tiny_llama.sh |
|
M src/nanotron/models/llama.py |
|
M src/nanotron/trainer.py |
|
Your branch is up to date with 'origin/bench_cluster'. |
|
Job status: RUNNING |
|
W0703 22:53:16.374000 139799845013312 torch/distributed/run.py:757] |
|
W0703 22:53:16.374000 139799845013312 torch/distributed/run.py:757] ***************************************** |
|
W0703 22:53:16.374000 139799845013312 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. |
|
W0703 22:53:16.374000 139799845013312 torch/distributed/run.py:757] ***************************************** |
|
[default0]:07/03/2024 22:53:32 [WARNING|DP=0|PP=0|TP=0|ip-26-0-161-178]: [Vocab Size Padding] Padded vocab (size: 50257) with 1 dummy tokens (new size: 50258) |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: Config: |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: Config(general=GeneralArgs(project='bench_cluster', |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: run='%date_%jobid', |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: seed=42, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: step=None, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: consumed_train_samples=None, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: benchmark_csv_path=None, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: ignore_sanity_checks=True), |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: parallelism=ParallelismArgs(dp=1, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: pp=4, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: tp=2, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: pp_engine=<nanotron.parallel.pipeline_parallel.engine.OneForwardOneBackwardPipelineEngine object at 0x7fdab79a8670>, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: tp_mode=<TensorParallelLinearMode.REDUCE_SCATTER: 2>, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: tp_linear_async_communication=False, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: expert_parallel_size=1), |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: model=ModelArgs(model_config=LlamaConfig(bos_token_id=1, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: eos_token_id=2, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: hidden_act='silu', |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: hidden_size=2048, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: initializer_range=0.02, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: intermediate_size=4096, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: is_llama_config=True, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: max_position_embeddings=4096, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: num_attention_heads=32, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: num_hidden_layers=24, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: num_key_value_heads=32, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: pad_token_id=None, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: pretraining_tp=1, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: rms_norm_eps=1e-05, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: rope_scaling=None, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: rope_theta=10000.0, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: tie_word_embeddings=True, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: use_cache=True, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: vocab_size=50258), |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: init_method=RandomInit(std=0.025), |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: dtype=torch.bfloat16, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: make_vocab_size_divisible_by=1, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: ddp_bucket_cap_mb=25), |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: tokenizer=TokenizerArgs(tokenizer_name_or_path='openai-community/gpt2', |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: tokenizer_revision=None, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: tokenizer_max_length=None), |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: checkpoints=CheckpointsArgs(checkpoints_path=Path('/dev/null'), |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: checkpoint_interval=100000, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: save_initial_state=False, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: resume_checkpoint_path=None, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: checkpoints_path_is_shared_file_system=False), |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: logging=LoggingArgs(log_level='info', |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: log_level_replica='info', |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: iteration_step_info_interval=1), |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: tokens=TokensArgs(sequence_length=4096, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: train_steps=20, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: micro_batch_size=64, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: batch_accumulation_per_replica=16, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: val_check_interval=-1, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: limit_val_batches=0, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: limit_test_batches=0), |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: optimizer=OptimizerArgs(optimizer_factory=AdamWOptimizerArgs(adam_eps=1e-08, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: adam_beta1=0.9, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: adam_beta2=0.95, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: torch_adam_is_fused=True, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: name='adamW'), |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: zero_stage=1, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: weight_decay=0.01, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: clip_grad=1.0, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: accumulate_grad_in_fp32=True, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: learning_rate_scheduler=LRSchedulerArgs(learning_rate=0.0001, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: lr_warmup_steps=1, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: lr_warmup_style='linear', |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: lr_decay_style='linear', |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: lr_decay_steps=19, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: lr_decay_starting_step=None, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: min_decay_lr=1e-05)), |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: data_stages=[DatasetStageArgs(name='Training Stage', |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: start_training_step=1, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: data=DataArgs(dataset=PretrainDatasetsArgs(hf_dataset_or_datasets='roneneldan/TinyStories', |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: hf_dataset_splits='train', |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: hf_dataset_config_name=None, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: dataset_processing_num_proc_per_process=64, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: dataset_overwrite_cache=False, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: text_column_name='text'), |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: seed=42, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: num_loading_workers=0))], |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: profiler=ProfilerArgs(profiler_export_path=Path('/fsx/ferdinandmom/ferdinand-hf/bench_cluster/results/llama-1B/8_GPUS/dp-1_tp-2_pp-4_mbz-64')), |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: lighteval=None) |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: Model Config: |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: LlamaConfig(bos_token_id=1, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: eos_token_id=2, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: hidden_act='silu', |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: hidden_size=2048, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: initializer_range=0.02, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: intermediate_size=4096, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: is_llama_config=True, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: max_position_embeddings=4096, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: num_attention_heads=32, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: num_hidden_layers=24, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: num_key_value_heads=32, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: pad_token_id=None, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: pretraining_tp=1, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: rms_norm_eps=1e-05, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: rope_scaling=None, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: rope_theta=10000.0, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: tie_word_embeddings=True, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: use_cache=True, |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: vocab_size=50258) |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: Building model.. |
|
[default0]:07/03/2024 22:53:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: Setting PP block ranks... |
|
[default0]:07/03/2024 22:53:46 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: Total number of parameters: 1.21G (2313.02MiB) |
|
[default0]:07/03/2024 22:53:46 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: Local number of parameters: 198M (378.21MiB) |
|
[default0]:07/03/2024 22:53:46 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: [After model building] Memory usage: 385.23MiB. Peak allocated: 387.26MiB Peak reserved: 402.00MiB |
|
[default2]:07/03/2024 22:53:46 [INFO|DP=0|PP=1|TP=0|ip-26-0-161-178]: Local number of parameters: 147M (280.05MiB) |
|
[default2]:07/03/2024 22:53:46 [INFO|DP=0|PP=1|TP=0|ip-26-0-161-178]: [After model building] Memory usage: 287.07MiB. Peak allocated: 289.10MiB Peak reserved: 302.00MiB |
|
[default0]:07/03/2024 22:53:46 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: No checkpoint path provided. |
|
[default5]:07/03/2024 22:53:46 [INFO|DP=0|PP=2|TP=1|ip-26-0-161-178]: Local number of parameters: 126M (240.05MiB) |
|
[default5]:07/03/2024 22:53:46 [INFO|DP=0|PP=2|TP=1|ip-26-0-161-178]: [After model building] Memory usage: 246.06MiB. Peak allocated: 248.09MiB Peak reserved: 262.00MiB |
|
[default1]:07/03/2024 22:53:46 [INFO|DP=0|PP=0|TP=1|ip-26-0-161-178]: Local number of parameters: 198M (378.21MiB) |
|
[default1]:07/03/2024 22:53:46 [INFO|DP=0|PP=0|TP=1|ip-26-0-161-178]: [After model building] Memory usage: 385.23MiB. Peak allocated: 387.26MiB Peak reserved: 402.00MiB |
|
[default0]:07/03/2024 22:53:46 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: Parametrizing model parameters using StandardParametrizator |
|
[default6]:07/03/2024 22:53:46 [INFO|DP=0|PP=3|TP=0|ip-26-0-161-178]: Local number of parameters: 135M (258.20MiB) |
|
[default6]:07/03/2024 22:53:46 [INFO|DP=0|PP=3|TP=0|ip-26-0-161-178]: [After model building] Memory usage: 262.21MiB. Peak allocated: 264.24MiB Peak reserved: 280.00MiB |
|
[default5]:07/03/2024 22:53:46 [INFO|DP=0|PP=2|TP=1|ip-26-0-161-178]: No checkpoint path provided. |
|
[default2]:07/03/2024 22:53:46 [INFO|DP=0|PP=1|TP=0|ip-26-0-161-178]: No checkpoint path provided. |
|
[default6]:07/03/2024 22:53:46 [INFO|DP=0|PP=3|TP=0|ip-26-0-161-178]: No checkpoint path provided. |
|
[default1]:07/03/2024 22:53:46 [INFO|DP=0|PP=0|TP=1|ip-26-0-161-178]: No checkpoint path provided. |
|
[default4]:07/03/2024 22:53:46 [INFO|DP=0|PP=2|TP=0|ip-26-0-161-178]: Local number of parameters: 126M (240.05MiB) |
|
[default4]:07/03/2024 22:53:46 [INFO|DP=0|PP=2|TP=0|ip-26-0-161-178]: [After model building] Memory usage: 246.06MiB. Peak allocated: 248.09MiB Peak reserved: 262.00MiB |
|
[default4]:07/03/2024 22:53:46 [INFO|DP=0|PP=2|TP=0|ip-26-0-161-178]: No checkpoint path provided. |
|
[default3]:07/03/2024 22:53:46 [INFO|DP=0|PP=1|TP=1|ip-26-0-161-178]: Local number of parameters: 147M (280.05MiB) |
|
[default3]:07/03/2024 22:53:46 [INFO|DP=0|PP=1|TP=1|ip-26-0-161-178]: [After model building] Memory usage: 287.07MiB. Peak allocated: 289.10MiB Peak reserved: 302.00MiB |
|
[default3]:07/03/2024 22:53:46 [INFO|DP=0|PP=1|TP=1|ip-26-0-161-178]: No checkpoint path provided. |
|
[default7]:07/03/2024 22:53:46 [INFO|DP=0|PP=3|TP=1|ip-26-0-161-178]: Local number of parameters: 135M (258.20MiB) |
|
[default7]:07/03/2024 22:53:46 [INFO|DP=0|PP=3|TP=1|ip-26-0-161-178]: [After model building] Memory usage: 262.21MiB. Peak allocated: 264.24MiB Peak reserved: 280.00MiB |
|
[default7]:07/03/2024 22:53:46 [INFO|DP=0|PP=3|TP=1|ip-26-0-161-178]: No checkpoint path provided. |
|
[default0]:07/03/2024 22:53:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: [Optimizer Building] Using LearningRateForSP as learning rate |
|
[default0]:07/03/2024 22:53:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: [ZeRO sharding] Size of optimizer params per rank: |
|
[default0]:07/03/2024 22:53:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: [ZeRO sharding] DP Rank 0 has 198M out of 198M (100.00%) params' optimizer states |
|
[default0]:07/03/2024 22:53:49 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: [Training Plan] Stage Training Stage has 19 remaining training steps and has consumed 0 samples |
|
[default0]:07/03/2024 22:53:49 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: Using `datasets` library |
|
[default0]:07/03/2024 22:53:49 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: Loading tokenizer from openai-community/gpt2 and transformers/hf_hub versions ('4.41.2', '0.23.4') |
|
[default0]:Repo card metadata block was not found. Setting CardData to empty. |
|
[default0]:07/03/2024 22:53:49 [WARNING|DP=0|PP=0|TP=0|ip-26-0-161-178]: Repo card metadata block was not found. Setting CardData to empty. |
|
[default0]:07/03/2024 22:53:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: [Training Plan] There are 1 training stages |
|
[default0]:07/03/2024 22:53:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: [Stage Training Stage] start from step 1 |
|
[default0]:07/03/2024 22:53:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: |
|
[default0]:07/03/2024 22:53:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: [Start training] datetime: 2024-07-03 22:53:50.184509 | mbs: 64 | grad_accum: 16 | global_batch_size: 1024 | sequence_length: 4096 | train_steps: 20 | start_iteration_step: 0 | consumed_train_samples: 0 |
|
[default0]:07/03/2024 22:53:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: Resuming training from stage Training Stage, it has trained for 0 samples and has 19 remaining train steps |
|
[default0]:07/03/2024 22:53:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-178]: Memory usage: 1898.09MiB. Peak allocated 1898.09MiB. Peak reserved: 1918.00MiB |
|
[default6]:07/03/2024 22:53:50 [WARNING|DP=0|PP=3|TP=0|ip-26-0-161-178]: Repo card metadata block was not found. Setting CardData to empty. |
|
[default1]:07/03/2024 22:53:50 [WARNING|DP=0|PP=0|TP=1|ip-26-0-161-178]: Repo card metadata block was not found. Setting CardData to empty. |
|
[default6]:Repo card metadata block was not found. Setting CardData to empty. |
|
[default1]:Repo card metadata block was not found. Setting CardData to empty. |
|
[default3]:07/03/2024 22:53:50 [WARNING|DP=0|PP=1|TP=1|ip-26-0-161-178]: Repo card metadata block was not found. Setting CardData to empty. |
|
[default3]:Repo card metadata block was not found. Setting CardData to empty. |
|
[default7]:Repo card metadata block was not found. Setting CardData to empty. |
|
[default7]:07/03/2024 22:53:50 [WARNING|DP=0|PP=3|TP=1|ip-26-0-161-178]: Repo card metadata block was not found. Setting CardData to empty. |
|
[default5]:Repo card metadata block was not found. Setting CardData to empty. |
|
[default5]:07/03/2024 22:53:50 [WARNING|DP=0|PP=2|TP=1|ip-26-0-161-178]: Repo card metadata block was not found. Setting CardData to empty. |
|
[default4]:07/03/2024 22:53:50 [WARNING|DP=0|PP=2|TP=0|ip-26-0-161-178]: Repo card metadata block was not found. Setting CardData to empty. |
|
[default4]:Repo card metadata block was not found. Setting CardData to empty. |
|
[default2]:07/03/2024 22:53:50 [WARNING|DP=0|PP=1|TP=0|ip-26-0-161-178]: Repo card metadata block was not found. Setting CardData to empty. |
|
[default2]:Repo card metadata block was not found. Setting CardData to empty. |
|
[default0]:[rank0]: Traceback (most recent call last): |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in <module> |
|
[default0]:[rank0]: trainer.train(dataloader) |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train |
|
[default0]:[rank0]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step |
|
[default0]:[rank0]: outputs = self.pipeline_engine.train_batch_iter( |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter |
|
[default0]:[rank0]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward |
|
[default0]:[rank0]: output = model(**micro_batch) |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default0]:[rank0]: return self._call_impl(*args, **kwargs) |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default0]:[rank0]: return forward_call(*args, **kwargs) |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward |
|
[default0]:[rank0]: sharded_logits = self.model( |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default0]:[rank0]: return self._call_impl(*args, **kwargs) |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default0]:[rank0]: return forward_call(*args, **kwargs) |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward |
|
[default0]:[rank0]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states |
|
[default0]:[rank0]: hidden_encoder_states = encoder_block(**hidden_encoder_states) |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default0]:[rank0]: return self._call_impl(*args, **kwargs) |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default0]:[rank0]: return forward_call(*args, **kwargs) |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward |
|
[default0]:[rank0]: output = self.pp_block(**new_kwargs) |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default0]:[rank0]: return self._call_impl(*args, **kwargs) |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default0]:[rank0]: return forward_call(*args, **kwargs) |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 637, in forward |
|
[default0]:[rank0]: hidden_states = self.mlp(hidden_states=hidden_states)["hidden_states"] |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default0]:[rank0]: return self._call_impl(*args, **kwargs) |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default0]:[rank0]: return forward_call(*args, **kwargs) |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 172, in forward |
|
[default0]:[rank0]: hidden_states = self.down_proj(self.split_silu_mul(merged_states)) |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default0]:[rank0]: return self._call_impl(*args, **kwargs) |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default0]:[rank0]: return forward_call(*args, **kwargs) |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 128, in forward |
|
[default0]:[rank0]: return self.act(gate_states) * up_states |
|
[default0]:[rank0]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1024.00 MiB. GPU |
|
[default1]:[rank1]: Traceback (most recent call last): |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in <module> |
|
[default1]:[rank1]: trainer.train(dataloader) |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train |
|
[default1]:[rank1]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step |
|
[default1]:[rank1]: outputs = self.pipeline_engine.train_batch_iter( |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter |
|
[default1]:[rank1]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward |
|
[default1]:[rank1]: output = model(**micro_batch) |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default1]:[rank1]: return self._call_impl(*args, **kwargs) |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default1]:[rank1]: return forward_call(*args, **kwargs) |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward |
|
[default1]:[rank1]: sharded_logits = self.model( |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default1]:[rank1]: return self._call_impl(*args, **kwargs) |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default1]:[rank1]: return forward_call(*args, **kwargs) |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward |
|
[default1]:[rank1]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states |
|
[default1]:[rank1]: hidden_encoder_states = encoder_block(**hidden_encoder_states) |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default1]:[rank1]: return self._call_impl(*args, **kwargs) |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default1]:[rank1]: return forward_call(*args, **kwargs) |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward |
|
[default1]:[rank1]: output = self.pp_block(**new_kwargs) |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default1]:[rank1]: return self._call_impl(*args, **kwargs) |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default1]:[rank1]: return forward_call(*args, **kwargs) |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 637, in forward |
|
[default1]:[rank1]: hidden_states = self.mlp(hidden_states=hidden_states)["hidden_states"] |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default1]:[rank1]: return self._call_impl(*args, **kwargs) |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default1]:[rank1]: return forward_call(*args, **kwargs) |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 172, in forward |
|
[default1]:[rank1]: hidden_states = self.down_proj(self.split_silu_mul(merged_states)) |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default1]:[rank1]: return self._call_impl(*args, **kwargs) |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default1]:[rank1]: return forward_call(*args, **kwargs) |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 128, in forward |
|
[default1]:[rank1]: return self.act(gate_states) * up_states |
|
[default1]:[rank1]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1024.00 MiB. GPU has a total capacity of 79.33 GiB of which 789.94 MiB is free. Including non-PyTorch memory, this process has 78.55 GiB memory in use. Of the allocated memory 64.99 GiB is allocated by PyTorch, and 1.42 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) |
|
[default2]:[rank2]: Traceback (most recent call last): |
|
[default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in <module> |
|
[default2]:[rank2]: trainer.train(dataloader) |
|
[default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train |
|
[default2]:[rank2]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) |
|
[default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step |
|
[default2]:[rank2]: outputs = self.pipeline_engine.train_batch_iter( |
|
[default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter |
|
[default2]:[rank2]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) |
|
[default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward |
|
[default6]:[rank6]: Traceback (most recent call last): |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in <module> |
|
[default6]:[rank6]: trainer.train(dataloader) |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train |
|
[default2]:[rank2]: output = model(**micro_batch) |
|
[default6]:[rank6]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step |
|
[default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default6]:[rank6]: outputs = self.pipeline_engine.train_batch_iter( |
|
[default2]:[rank2]: return self._call_impl(*args, **kwargs) |
|
[default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default2]:[rank2]: return forward_call(*args, **kwargs) |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter |
|
[default6]:[rank6]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward |
|
[default6]:[rank6]: output = model(**micro_batch) |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward |
|
[default2]:[rank2]: sharded_logits = self.model( |
|
[default6]:[rank6]: return self._call_impl(*args, **kwargs) |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default2]:[rank2]: return self._call_impl(*args, **kwargs) |
|
[default6]:[rank6]: return forward_call(*args, **kwargs) |
|
[default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default2]:[rank2]: return forward_call(*args, **kwargs) |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward |
|
[default6]:[rank6]: sharded_logits = self.model( |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default6]:[rank6]: return self._call_impl(*args, **kwargs) |
|
[default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward |
|
[default2]:[rank2]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] |
|
[default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states |
|
[default2]:[rank2]: hidden_encoder_states = encoder_block(**hidden_encoder_states) |
|
[default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default2]:[rank2]: return self._call_impl(*args, **kwargs) |
|
[default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default2]:[rank2]: return forward_call(*args, **kwargs) |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward |
|
[default2]:[rank2]: new_kwargs[name] = recv_from_pipeline_state_buffer( |
|
[default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer |
|
[default2]:[rank2]: pipeline_state.run_communication() |
|
[default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication |
|
[default6]:[rank6]: return forward_call(*args, **kwargs) |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward |
|
[default2]:[rank2]: recv_activation_tensor = recv_activation() |
|
[default6]:[rank6]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] |
|
[default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states |
|
[default2]:[rank2]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] |
|
[default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors |
|
[default2]:[rank2]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) |
|
[default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors |
|
[default6]:[rank6]: hidden_encoder_states = encoder_block(**hidden_encoder_states) |
|
[default2]:[rank2]: meta = self._recv_meta(from_rank=from_rank, tag=tag) |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta |
|
[default6]:[rank6]: return self._call_impl(*args, **kwargs) |
|
[default2]:[rank2]: dist.recv( |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default6]:[rank6]: return forward_call(*args, **kwargs) |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward |
|
[default6]:[rank6]: new_kwargs[name] = recv_from_pipeline_state_buffer( |
|
[default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer |
|
[default6]:[rank6]: pipeline_state.run_communication() |
|
[default2]:[rank2]: return func(*args, **kwargs) |
|
[default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication |
|
[default2]:[rank2]: pg.recv([tensor], group_src_rank, tag).wait() |
|
[default2]:[rank2]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer |
|
[default6]:[rank6]: recv_activation_tensor = recv_activation() |
|
[default2]:[rank2]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ |
|
[default2]:[rank2]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2a343cf897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) |
|
[default6]:[rank6]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] |
|
[default2]:[rank2]: frame #1: <unknown function> + 0x5b3a23e (0x7f2a6deec23e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default2]:[rank2]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::string>, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x2c7 (0x7f2a6dee6c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors |
|
[default2]:[rank2]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f2a6dee6f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default6]:[rank6]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors |
|
[default6]:[rank6]: meta = self._recv_meta(from_rank=from_rank, tag=tag) |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta |
|
[default6]:[rank6]: dist.recv( |
|
[default2]:[rank2]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f2a6dee7fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default2]:[rank2]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2a6de9c371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper |
|
[default2]:[rank2]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2a6de9c371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default2]:[rank2]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2a6de9c371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default6]:[rank6]: return func(*args, **kwargs) |
|
[default2]:[rank2]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2a6de9c371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default2]:[rank2]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f2a356a9189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) |
|
[default2]:[rank2]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f2a356b0610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv |
|
[default6]:[rank6]: pg.recv([tensor], group_src_rank, tag).wait() |
|
[default6]:[rank6]: torch.distributed.DistBackendError: [3] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '2:3', but store->get('2:3') got error: Connection reset by peer |
|
[default6]:[rank6]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): |
|
[default6]:[rank6]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fe546fc6897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) |
|
[default6]:[rank6]: frame #1: <unknown function> + 0x5b3a23e (0x7fe580ae323e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default6]:[rank6]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::string>, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x2c7 (0x7fe580addc87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default6]:[rank6]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fe580addf82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default2]:[rank2]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector<at::Tensor, std::allocator<at::Tensor> >&, int, int) + 0x5f8 (0x7f2a356cf978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) |
|
[default6]:[rank6]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fe580adefd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default6]:[rank6]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fe580a93371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default2]:[rank2]: frame #12: <unknown function> + 0x5adc309 (0x7f2a6de8e309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default2]:[rank2]: frame #13: <unknown function> + 0x5ae6f10 (0x7f2a6de98f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default6]:[rank6]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fe580a93371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default6]:[rank6]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fe580a93371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default6]:[rank6]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fe580a93371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default6]:[rank6]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fe5482a0189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) |
|
[default6]:[rank6]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7fe5482a7610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) |
|
[default2]:[rank2]: frame #14: <unknown function> + 0x5ae6fa5 (0x7f2a6de98fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default6]:[rank6]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector<at::Tensor, std::allocator<at::Tensor> >&, int, int) + 0x5f8 (0x7fe5482c6978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) |
|
[default2]:[rank2]: frame #15: <unknown function> + 0x5124446 (0x7f2a6d4d6446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default2]:[rank2]: frame #16: <unknown function> + 0x1acf4b8 (0x7f2a69e814b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default2]:[rank2]: frame #17: <unknown function> + 0x5aee004 (0x7f2a6dea0004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default2]:[rank2]: frame #18: <unknown function> + 0x5af36b5 (0x7f2a6dea56b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default6]:[rank6]: frame #12: <unknown function> + 0x5adc309 (0x7fe580a85309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default2]:[rank2]: frame #19: <unknown function> + 0xd2631e (0x7f2a80a8f31e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) |
|
[default6]:[rank6]: frame #13: <unknown function> + 0x5ae6f10 (0x7fe580a8ff10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default2]:[rank2]: frame #20: <unknown function> + 0x47def4 (0x7f2a801e6ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) |
|
[default2]:[rank2]: frame #21: <unknown function> + 0x1445a6 (0x5644975925a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank2]: frame #22: _PyObject_MakeTpCall + 0x26b (0x56449758ba6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank2]: frame #23: <unknown function> + 0x150866 (0x56449759e866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank2]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x564497587142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank2]: frame #25: _PyFunction_Vectorcall + 0x6c (0x564497592a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #14: <unknown function> + 0x5ae6fa5 (0x7fe580a8ffa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default6]:[rank6]: frame #15: <unknown function> + 0x5124446 (0x7fe5800cd446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default6]:[rank6]: frame #16: <unknown function> + 0x1acf4b8 (0x7fe57ca784b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default2]:[rank2]: frame #26: PyObject_Call + 0xbc (0x56449759ef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #17: <unknown function> + 0x5aee004 (0x7fe580a97004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default2]:[rank2]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5644975852b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank2]: frame #28: _PyFunction_Vectorcall + 0x6c (0x564497592a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank2]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5644975838fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank2]: frame #30: <unknown function> + 0x150582 (0x56449759e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank2]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5644975838fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank2]: frame #32: <unknown function> + 0x150582 (0x56449759e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #18: <unknown function> + 0x5af36b5 (0x7fe580a9c6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default6]:[rank6]: frame #19: <unknown function> + 0xd2631e (0x7fe59368631e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) |
|
[default6]:[rank6]: frame #20: <unknown function> + 0x47def4 (0x7fe592dddef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) |
|
[default2]:[rank2]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5644975838fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank2]: frame #34: <unknown function> + 0x150582 (0x56449759e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #21: <unknown function> + 0x1445a6 (0x557de2e595a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank2]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5644975838fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank2]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x56449758af50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank2]: frame #37: _PyObject_Call_Prepend + 0x69 (0x56449759cc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank2]: frame #38: <unknown function> + 0x211239 (0x56449765f239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #22: _PyObject_MakeTpCall + 0x26b (0x557de2e52a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #23: <unknown function> + 0x150866 (0x557de2e65866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x557de2e4e142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank2]: frame #39: _PyObject_MakeTpCall + 0x26b (0x56449758ba6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank2]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5644975873e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank2]: frame #41: _PyFunction_Vectorcall + 0x6c (0x564497592a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank2]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x564497582c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank2]: frame #43: _PyFunction_Vectorcall + 0x6c (0x564497592a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank2]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5644975838fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #25: _PyFunction_Vectorcall + 0x6c (0x557de2e59a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank2]: frame #45: <unknown function> + 0x150582 (0x56449759e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #26: PyObject_Call + 0xbc (0x557de2e65f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x557de2e4c2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank2]: frame #46: PyObject_Call + 0xbc (0x56449759ef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank2]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5644975852b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank2]: frame #48: <unknown function> + 0x150582 (0x56449759e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank2]: frame #49: PyObject_Call + 0xbc (0x56449759ef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank2]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5644975852b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #28: _PyFunction_Vectorcall + 0x6c (0x557de2e59a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x557de2e4a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #30: <unknown function> + 0x150582 (0x557de2e65582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x557de2e4a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank2]: frame #51: _PyFunction_Vectorcall + 0x6c (0x564497592a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #32: <unknown function> + 0x150582 (0x557de2e65582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x557de2e4a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #34: <unknown function> + 0x150582 (0x557de2e65582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x557de2e4a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank2]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x56449758b007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x557de2e51f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank2]: frame #53: _PyObject_Call_Prepend + 0x69 (0x56449759cc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #37: _PyObject_Call_Prepend + 0x69 (0x557de2e63c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank2]: frame #54: <unknown function> + 0x211239 (0x56449765f239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank2]: frame #55: PyObject_Call + 0x207 (0x56449759f067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #38: <unknown function> + 0x211239 (0x557de2f26239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank2]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5644975852b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank2]: frame #57: <unknown function> + 0x150582 (0x56449759e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #39: _PyObject_MakeTpCall + 0x26b (0x557de2e52a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank2]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5644975838fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank2]: frame #59: <unknown function> + 0x150582 (0x56449759e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank2]: frame #60: PyObject_Call + 0xbc (0x56449759ef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank2]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5644975852b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x557de2e4e3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank2]: frame #62: <unknown function> + 0x150582 (0x56449759e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #41: _PyFunction_Vectorcall + 0x6c (0x557de2e59a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank2]: frame #63: PyObject_Call + 0xbc (0x56449759ef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x557de2e49c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #43: _PyFunction_Vectorcall + 0x6c (0x557de2e59a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default2]:[rank2]: . This may indicate a possible application crash on rank 0 or a network set up issue. |
|
[default6]:[rank6]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x557de2e4a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #45: <unknown function> + 0x150582 (0x557de2e65582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #46: PyObject_Call + 0xbc (0x557de2e65f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x557de2e4c2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #48: <unknown function> + 0x150582 (0x557de2e65582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #49: PyObject_Call + 0xbc (0x557de2e65f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x557de2e4c2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #51: _PyFunction_Vectorcall + 0x6c (0x557de2e59a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x557de2e52007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #53: _PyObject_Call_Prepend + 0x69 (0x557de2e63c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #54: <unknown function> + 0x211239 (0x557de2f26239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #55: PyObject_Call + 0x207 (0x557de2e66067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x557de2e4c2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #57: <unknown function> + 0x150582 (0x557de2e65582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x557de2e4a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #59: <unknown function> + 0x150582 (0x557de2e65582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #60: PyObject_Call + 0xbc (0x557de2e65f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x557de2e4c2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #62: <unknown function> + 0x150582 (0x557de2e65582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: frame #63: PyObject_Call + 0xbc (0x557de2e65f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default6]:[rank6]: . This may indicate a possible application crash on rank 0 or a network set up issue. |
|
[default7]:[rank7]: Traceback (most recent call last): |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in <module> |
|
[default7]:[rank7]: trainer.train(dataloader) |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train |
|
[default7]:[rank7]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step |
|
[default7]:[rank7]: outputs = self.pipeline_engine.train_batch_iter( |
|
[default5]:[rank5]: Traceback (most recent call last): |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in <module> |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 278, in train_batch_iter |
|
[default7]:[rank7]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward |
|
[default5]:[rank5]: trainer.train(dataloader) |
|
[default3]:[rank3]: Traceback (most recent call last): |
|
[default7]:[rank7]: output = model(**micro_batch) |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default7]:[rank7]: return self._call_impl(*args, **kwargs) |
|
[default4]:[rank4]: Traceback (most recent call last): |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default7]:[rank7]: return forward_call(*args, **kwargs) |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in <module> |
|
[default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in <module> |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward |
|
[default4]:[rank4]: trainer.train(dataloader) |
|
[default3]:[rank3]: trainer.train(dataloader) |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train |
|
[default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train |
|
[default4]:[rank4]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) |
|
[default7]:[rank7]: sharded_logits = self.model( |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default5]:[rank5]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step |
|
[default3]:[rank3]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) |
|
[default7]:[rank7]: return self._call_impl(*args, **kwargs) |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step |
|
[default3]:[rank3]: outputs = self.pipeline_engine.train_batch_iter( |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step |
|
[default5]:[rank5]: outputs = self.pipeline_engine.train_batch_iter( |
|
[default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter |
|
[default4]:[rank4]: outputs = self.pipeline_engine.train_batch_iter( |
|
[default7]:[rank7]: return forward_call(*args, **kwargs) |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward |
|
[default3]:[rank3]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) |
|
[default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward |
|
[default3]:[rank3]: output = model(**micro_batch) |
|
[default7]:[rank7]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter |
|
[default5]:[rank5]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) |
|
[default7]:[rank7]: hidden_encoder_states = encoder_block(**hidden_encoder_states) |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter |
|
[default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward |
|
[default5]:[rank5]: output = model(**micro_batch) |
|
[default7]:[rank7]: return self._call_impl(*args, **kwargs) |
|
[default3]:[rank3]: return self._call_impl(*args, **kwargs) |
|
[default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default3]:[rank3]: return forward_call(*args, **kwargs) |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default7]:[rank7]: return forward_call(*args, **kwargs) |
|
[default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward |
|
[default4]:[rank4]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) |
|
[default5]:[rank5]: return self._call_impl(*args, **kwargs) |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward |
|
[default3]:[rank3]: sharded_logits = self.model( |
|
[default4]:[rank4]: output = model(**micro_batch) |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default7]:[rank7]: new_kwargs[name] = recv_from_pipeline_state_buffer( |
|
[default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default5]:[rank5]: return forward_call(*args, **kwargs) |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default3]:[rank3]: return self._call_impl(*args, **kwargs) |
|
[default7]:[rank7]: pipeline_state.run_communication() |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward |
|
[default4]:[rank4]: return self._call_impl(*args, **kwargs) |
|
[default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default5]:[rank5]: sharded_logits = self.model( |
|
[default3]:[rank3]: return forward_call(*args, **kwargs) |
|
[default7]:[rank7]: recv_activation_tensor = recv_activation() |
|
[default4]:[rank4]: return forward_call(*args, **kwargs) |
|
[default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward |
|
[default3]:[rank3]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default5]:[rank5]: return self._call_impl(*args, **kwargs) |
|
[default7]:[rank7]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] |
|
[default4]:[rank4]: sharded_logits = self.model( |
|
[default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default5]:[rank5]: return forward_call(*args, **kwargs) |
|
[default3]:[rank3]: hidden_encoder_states = encoder_block(**hidden_encoder_states) |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors |
|
[default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward |
|
[default5]:[rank5]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] |
|
[default3]:[rank3]: return self._call_impl(*args, **kwargs) |
|
[default7]:[rank7]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) |
|
[default4]:[rank4]: return self._call_impl(*args, **kwargs) |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states |
|
[default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default5]:[rank5]: hidden_encoder_states = encoder_block(**hidden_encoder_states) |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors |
|
[default4]:[rank4]: return forward_call(*args, **kwargs) |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default7]:[rank7]: meta = self._recv_meta(from_rank=from_rank, tag=tag) |
|
[default3]:[rank3]: return forward_call(*args, **kwargs) |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward |
|
[default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward |
|
[default5]:[rank5]: return self._call_impl(*args, **kwargs) |
|
[default7]:[rank7]: dist.recv( |
|
[default3]:[rank3]: new_kwargs[name] = recv_from_pipeline_state_buffer( |
|
[default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer |
|
[default4]:[rank4]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default5]:[rank5]: return forward_call(*args, **kwargs) |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper |
|
[default3]:[rank3]: pipeline_state.run_communication() |
|
[default4]:[rank4]: hidden_encoder_states = encoder_block(**hidden_encoder_states) |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
|
[default4]:[rank4]: return self._call_impl(*args, **kwargs) |
|
[default7]:[rank7]: return func(*args, **kwargs) |
|
[default5]:[rank5]: new_kwargs[name] = recv_from_pipeline_state_buffer( |
|
[default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
|
[default3]:[rank3]: recv_activation_tensor = recv_activation() |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv |
|
[default7]:[rank7]: pg.recv([tensor], group_src_rank, tag).wait() |
|
[default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ |
|
[default4]:[rank4]: return forward_call(*args, **kwargs) |
|
[default7]:[rank7]: torch.distributed.DistBackendError: [3] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '2:3', but store->get('2:3') got error: Connection reset by peer |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer |
|
[default3]:[rank3]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward |
|
[default7]:[rank7]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): |
|
[default5]:[rank5]: pipeline_state.run_communication() |
|
[default4]:[rank4]: new_kwargs[name] = recv_from_pipeline_state_buffer( |
|
[default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors |
|
[default7]:[rank7]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8327b36897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer |
|
[default3]:[rank3]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) |
|
[default5]:[rank5]: recv_activation_tensor = recv_activation() |
|
[default7]:[rank7]: frame #1: <unknown function> + 0x5b3a23e (0x7f836165323e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default4]:[rank4]: pipeline_state.run_communication() |
|
[default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors |
|
[default7]:[rank7]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::string>, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x2c7 (0x7f836164dc87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication |
|
[default3]:[rank3]: meta = self._recv_meta(from_rank=from_rank, tag=tag) |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ |
|
[default7]:[rank7]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f836164df82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default5]:[rank5]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] |
|
[default4]:[rank4]: recv_activation_tensor = recv_activation() |
|
[default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta |
|
[default7]:[rank7]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f836164efd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default7]:[rank7]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f8361603371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ |
|
[default3]:[rank3]: dist.recv( |
|
[default7]:[rank7]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f8361603371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors |
|
[default4]:[rank4]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] |
|
[default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper |
|
[default7]:[rank7]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f8361603371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default7]:[rank7]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f8361603371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default3]:[rank3]: return func(*args, **kwargs) |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors |
|
[default7]:[rank7]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f8328e10189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) |
|
[default5]:[rank5]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) |
|
[default4]:[rank4]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) |
|
[default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv |
|
[default7]:[rank7]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f8328e17610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) |
|
[default7]:[rank7]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector<at::Tensor, std::allocator<at::Tensor> >&, int, int) + 0x5f8 (0x7f8328e36978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) |
|
[default3]:[rank3]: pg.recv([tensor], group_src_rank, tag).wait() |
|
[default3]:[rank3]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors |
|
[default7]:[rank7]: frame #12: <unknown function> + 0x5adc309 (0x7f83615f5309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors |
|
[default3]:[rank3]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): |
|
[default7]:[rank7]: frame #13: <unknown function> + 0x5ae6f10 (0x7f83615fff10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default5]:[rank5]: meta = self._recv_meta(from_rank=from_rank, tag=tag) |
|
[default4]:[rank4]: meta = self._recv_meta(from_rank=from_rank, tag=tag) |
|
[default7]:[rank7]: frame #14: <unknown function> + 0x5ae6fa5 (0x7f83615fffa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default3]:[rank3]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f9182350897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta |
|
[default3]:[rank3]: frame #1: <unknown function> + 0x5b3a23e (0x7f91bbe6d23e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default3]:[rank3]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::string>, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x2c7 (0x7f91bbe67c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default4]:[rank4]: dist.recv( |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper |
|
[default3]:[rank3]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f91bbe67f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default4]:[rank4]: return func(*args, **kwargs) |
|
[default7]:[rank7]: frame #15: <unknown function> + 0x5124446 (0x7f8360c3d446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default3]:[rank3]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f91bbe68fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default5]:[rank5]: dist.recv( |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv |
|
[default7]:[rank7]: frame #16: <unknown function> + 0x1acf4b8 (0x7f835d5e84b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper |
|
[default3]:[rank3]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f91bbe1d371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default4]:[rank4]: pg.recv([tensor], group_src_rank, tag).wait() |
|
[default7]:[rank7]: frame #17: <unknown function> + 0x5aee004 (0x7f8361607004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default3]:[rank3]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f91bbe1d371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default5]:[rank5]: return func(*args, **kwargs) |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1932, in recv |
|
[default3]:[rank3]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f91bbe1d371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default5]:[rank5]: pg.recv([tensor], group_src_rank, tag).wait() |
|
[default4]:[rank4]: torch.distributed.DistBackendError: [2] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '1:2', but store->get('1:2') got error: Connection reset by peer |
|
[default7]:[rank7]: frame #18: <unknown function> + 0x5af36b5 (0x7f836160c6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default5]:[rank5]: torch.distributed.DistBackendError: [2] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '1:2', but store->get('1:2') got error: Connection reset by peer |
|
[default4]:[rank4]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): |
|
[default7]:[rank7]: frame #19: <unknown function> + 0xd2631e (0x7f83741f631e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) |
|
[default7]:[rank7]: frame #20: <unknown function> + 0x47def4 (0x7f837394def4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) |
|
[default3]:[rank3]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f91bbe1d371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default3]:[rank3]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f918362a189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) |
|
[default5]:[rank5]: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): |
|
[default4]:[rank4]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f6c6d4ec897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) |
|
[default4]:[rank4]: frame #1: <unknown function> + 0x5b3a23e (0x7f6ca700923e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default5]:[rank5]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f82fd719897 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) |
|
[default7]:[rank7]: frame #21: <unknown function> + 0x1445a6 (0x55f67f97c5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::string>, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x2c7 (0x7f6ca7003c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default3]:[rank3]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f9183631610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) |
|
[default3]:[rank3]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector<at::Tensor, std::allocator<at::Tensor> >&, int, int) + 0x5f8 (0x7f9183650978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) |
|
[default4]:[rank4]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f6ca7003f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default7]:[rank7]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55f67f975a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #1: <unknown function> + 0x5b3a23e (0x7f833723623e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default7]:[rank7]: frame #23: <unknown function> + 0x150866 (0x55f67f988866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55f67f971142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank3]: frame #12: <unknown function> + 0x5adc309 (0x7f91bbe0f309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default3]:[rank3]: frame #13: <unknown function> + 0x5ae6f10 (0x7f91bbe19f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default7]:[rank7]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55f67f97ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f6ca7004fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default5]:[rank5]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::string>, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x2c7 (0x7f8337230c87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default5]:[rank5]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f8337230f82 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default4]:[rank4]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f6ca6fb9371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default7]:[rank7]: frame #26: PyObject_Call + 0xbc (0x55f67f988f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f8337231fd1 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default3]:[rank3]: frame #14: <unknown function> + 0x5ae6fa5 (0x7f91bbe19fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default4]:[rank4]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f6ca6fb9371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default7]:[rank7]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55f67f96f2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f83371e6371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default3]:[rank3]: frame #15: <unknown function> + 0x5124446 (0x7f91bb457446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default7]:[rank7]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55f67f97ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f6ca6fb9371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default3]:[rank3]: frame #16: <unknown function> + 0x1acf4b8 (0x7f91b7e024b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default5]:[rank5]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f83371e6371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default7]:[rank7]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55f67f96d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f6ca6fb9371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default3]:[rank3]: frame #17: <unknown function> + 0x5aee004 (0x7f91bbe21004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default5]:[rank5]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f83371e6371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default4]:[rank4]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f6c6e7c6189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) |
|
[default4]:[rank4]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f6c6e7cd610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) |
|
[default5]:[rank5]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f83371e6371 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default3]:[rank3]: frame #18: <unknown function> + 0x5af36b5 (0x7f91bbe266b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default7]:[rank7]: frame #30: <unknown function> + 0x150582 (0x55f67f988582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector<at::Tensor, std::allocator<at::Tensor> >&, int, int) + 0x5f8 (0x7f6c6e7ec978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) |
|
[default5]:[rank5]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f82fe9f3189 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) |
|
[default3]:[rank3]: frame #19: <unknown function> + 0xd2631e (0x7f91cea1031e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) |
|
[default7]:[rank7]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55f67f96d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #12: <unknown function> + 0x5adc309 (0x7f6ca6fab309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default4]:[rank4]: frame #13: <unknown function> + 0x5ae6f10 (0x7f6ca6fb5f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default3]:[rank3]: frame #20: <unknown function> + 0x47def4 (0x7f91ce167ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) |
|
[default7]:[rank7]: frame #32: <unknown function> + 0x150582 (0x55f67f988582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #14: <unknown function> + 0x5ae6fa5 (0x7f6ca6fb5fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default3]:[rank3]: frame #21: <unknown function> + 0x1445a6 (0x55824bff85a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55f67f96d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7f82fe9fa610 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) |
|
[default3]:[rank3]: frame #22: _PyObject_MakeTpCall + 0x26b (0x55824bff1a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #15: <unknown function> + 0x5124446 (0x7f6ca65f3446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default3]:[rank3]: frame #23: <unknown function> + 0x150866 (0x55824c004866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #34: <unknown function> + 0x150582 (0x55f67f988582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector<at::Tensor, std::allocator<at::Tensor> >&, int, int) + 0x5f8 (0x7f82fea19978 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) |
|
[default3]:[rank3]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55824bfed142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #16: <unknown function> + 0x1acf4b8 (0x7f6ca2f9e4b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default7]:[rank7]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55f67f96d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank3]: frame #25: _PyFunction_Vectorcall + 0x6c (0x55824bff8a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #17: <unknown function> + 0x5aee004 (0x7f6ca6fbd004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default3]:[rank3]: frame #26: PyObject_Call + 0xbc (0x55824c004f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #12: <unknown function> + 0x5adc309 (0x7f83371d8309 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default4]:[rank4]: frame #18: <unknown function> + 0x5af36b5 (0x7f6ca6fc26b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default3]:[rank3]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55824bfeb2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55f67f974f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #19: <unknown function> + 0xd2631e (0x7f6cb9bac31e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) |
|
[default5]:[rank5]: frame #13: <unknown function> + 0x5ae6f10 (0x7f83371e2f10 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default3]:[rank3]: frame #28: _PyFunction_Vectorcall + 0x6c (0x55824bff8a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55f67f986c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #14: <unknown function> + 0x5ae6fa5 (0x7f83371e2fa5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default5]:[rank5]: frame #15: <unknown function> + 0x5124446 (0x7f8336820446 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default3]:[rank3]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55824bfe98fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #38: <unknown function> + 0x211239 (0x55f67fa49239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #20: <unknown function> + 0x47def4 (0x7f6cb9303ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) |
|
[default3]:[rank3]: frame #30: <unknown function> + 0x150582 (0x55824c004582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55f67f975a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #21: <unknown function> + 0x1445a6 (0x5566921c85a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #16: <unknown function> + 0x1acf4b8 (0x7f83331cb4b8 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default3]:[rank3]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55824bfe98fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55f67f9713e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #22: _PyObject_MakeTpCall + 0x26b (0x5566921c1a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #17: <unknown function> + 0x5aee004 (0x7f83371ea004 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default3]:[rank3]: frame #32: <unknown function> + 0x150582 (0x55824c004582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55f67f97ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #23: <unknown function> + 0x150866 (0x5566921d4866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #18: <unknown function> + 0x5af36b5 (0x7f83371ef6b5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) |
|
[default3]:[rank3]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55824bfe98fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55f67f96cc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x5566921bd142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank3]: frame #34: <unknown function> + 0x150582 (0x55824c004582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #19: <unknown function> + 0xd2631e (0x7f8349dd931e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) |
|
[default7]:[rank7]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55f67f97ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #25: _PyFunction_Vectorcall + 0x6c (0x5566921c8a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank3]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55824bfe98fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #20: <unknown function> + 0x47def4 (0x7f8349530ef4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) |
|
[default7]:[rank7]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55f67f96d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #26: PyObject_Call + 0xbc (0x5566921d4f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #21: <unknown function> + 0x1445a6 (0x5594ae69b5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank3]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55824bff0f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #45: <unknown function> + 0x150582 (0x55f67f988582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5566921bb2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #22: _PyObject_MakeTpCall + 0x26b (0x5594ae694a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank3]: frame #37: _PyObject_Call_Prepend + 0x69 (0x55824c002c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #46: PyObject_Call + 0xbc (0x55f67f988f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #28: _PyFunction_Vectorcall + 0x6c (0x5566921c8a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank3]: frame #38: <unknown function> + 0x211239 (0x55824c0c5239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55f67f96f2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5566921b98fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #23: <unknown function> + 0x150866 (0x5594ae6a7866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank3]: frame #39: _PyObject_MakeTpCall + 0x26b (0x55824bff1a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x5594ae690142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #48: <unknown function> + 0x150582 (0x55f67f988582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #30: <unknown function> + 0x150582 (0x5566921d4582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank3]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55824bfed3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #25: _PyFunction_Vectorcall + 0x6c (0x5594ae69ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5566921b98fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #49: PyObject_Call + 0xbc (0x55f67f988f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #26: PyObject_Call + 0xbc (0x5594ae6a7f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank3]: frame #41: _PyFunction_Vectorcall + 0x6c (0x55824bff8a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #32: <unknown function> + 0x150582 (0x5566921d4582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55f67f96f2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5594ae68e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank3]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55824bfe8c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5566921b98fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55f67f97ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #28: _PyFunction_Vectorcall + 0x6c (0x5594ae69ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank3]: frame #43: _PyFunction_Vectorcall + 0x6c (0x55824bff8a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #34: <unknown function> + 0x150582 (0x5566921d4582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55f67f975007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55f67f986c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank3]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55824bfe98fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5566921b98fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #54: <unknown function> + 0x211239 (0x55f67fa49239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x5566921c0f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5594ae68c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank3]: frame #45: <unknown function> + 0x150582 (0x55824c004582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #37: _PyObject_Call_Prepend + 0x69 (0x5566921d2c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #55: PyObject_Call + 0x207 (0x55f67f989067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55f67f96f2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #57: <unknown function> + 0x150582 (0x55f67f988582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #38: <unknown function> + 0x211239 (0x556692295239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank3]: frame #46: PyObject_Call + 0xbc (0x55824c004f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55f67f96d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #39: _PyObject_MakeTpCall + 0x26b (0x5566921c1a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #30: <unknown function> + 0x150582 (0x5594ae6a7582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank3]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55824bfeb2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5594ae68c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5566921bd3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank3]: frame #48: <unknown function> + 0x150582 (0x55824c004582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank3]: frame #49: PyObject_Call + 0xbc (0x55824c004f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #41: _PyFunction_Vectorcall + 0x6c (0x5566921c8a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #32: <unknown function> + 0x150582 (0x5594ae6a7582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #59: <unknown function> + 0x150582 (0x55f67f988582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x5566921b8c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5594ae68c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank3]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55824bfeb2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #60: PyObject_Call + 0xbc (0x55f67f988f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank3]: frame #51: _PyFunction_Vectorcall + 0x6c (0x55824bff8a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #43: _PyFunction_Vectorcall + 0x6c (0x5566921c8a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #34: <unknown function> + 0x150582 (0x5594ae6a7582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55f67f96f2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank3]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55824bff1007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5566921b98fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #62: <unknown function> + 0x150582 (0x55f67f988582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank3]: frame #53: _PyObject_Call_Prepend + 0x69 (0x55824c002c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #45: <unknown function> + 0x150582 (0x5566921d4582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: frame #63: PyObject_Call + 0xbc (0x55f67f988f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank3]: frame #54: <unknown function> + 0x211239 (0x55824c0c5239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5594ae68c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default7]:[rank7]: . This may indicate a possible application crash on rank 0 or a network set up issue. |
|
[default4]:[rank4]: frame #46: PyObject_Call + 0xbc (0x5566921d4f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank3]: frame #55: PyObject_Call + 0x207 (0x55824c005067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5566921bb2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x5594ae693f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank3]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55824bfeb2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #37: _PyObject_Call_Prepend + 0x69 (0x5594ae6a5c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank3]: frame #57: <unknown function> + 0x150582 (0x55824c004582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #38: <unknown function> + 0x211239 (0x5594ae768239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #48: <unknown function> + 0x150582 (0x5566921d4582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #49: PyObject_Call + 0xbc (0x5566921d4f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #39: _PyObject_MakeTpCall + 0x26b (0x5594ae694a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank3]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55824bfe98fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5594ae6903e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank3]: frame #59: <unknown function> + 0x150582 (0x55824c004582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5566921bb2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #41: _PyFunction_Vectorcall + 0x6c (0x5594ae69ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #51: _PyFunction_Vectorcall + 0x6c (0x5566921c8a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank3]: frame #60: PyObject_Call + 0xbc (0x55824c004f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #42: _PyEval_EvalFrameDefault + 0x72c (0x5594ae68bc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank3]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55824bfeb2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #43: _PyFunction_Vectorcall + 0x6c (0x5594ae69ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5594ae68c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x5566921c1007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank3]: frame #62: <unknown function> + 0x150582 (0x55824c004582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #53: _PyObject_Call_Prepend + 0x69 (0x5566921d2c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #45: <unknown function> + 0x150582 (0x5594ae6a7582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #54: <unknown function> + 0x211239 (0x556692295239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #46: PyObject_Call + 0xbc (0x5594ae6a7f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #55: PyObject_Call + 0x207 (0x5566921d5067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5594ae68e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #48: <unknown function> + 0x150582 (0x5594ae6a7582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank3]: frame #63: PyObject_Call + 0xbc (0x55824c004f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5566921bb2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default3]:[rank3]: . This may indicate a possible application crash on rank 0 or a network set up issue. |
|
[default4]:[rank4]: frame #57: <unknown function> + 0x150582 (0x5566921d4582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5566921b98fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #59: <unknown function> + 0x150582 (0x5566921d4582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #60: PyObject_Call + 0xbc (0x5566921d4f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5566921bb2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #62: <unknown function> + 0x150582 (0x5566921d4582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: frame #63: PyObject_Call + 0xbc (0x5566921d4f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default4]:[rank4]: . This may indicate a possible application crash on rank 0 or a network set up issue. |
|
[default5]:[rank5]: frame #49: PyObject_Call + 0xbc (0x5594ae6a7f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5594ae68e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #51: _PyFunction_Vectorcall + 0x6c (0x5594ae69ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #52: _PyObject_FastCallDictTstate + 0x187 (0x5594ae694007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #53: _PyObject_Call_Prepend + 0x69 (0x5594ae6a5c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #54: <unknown function> + 0x211239 (0x5594ae768239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #55: PyObject_Call + 0x207 (0x5594ae6a8067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5594ae68e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #57: <unknown function> + 0x150582 (0x5594ae6a7582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5594ae68c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #59: <unknown function> + 0x150582 (0x5594ae6a7582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #60: PyObject_Call + 0xbc (0x5594ae6a7f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5594ae68e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #62: <unknown function> + 0x150582 (0x5594ae6a7582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: frame #63: PyObject_Call + 0xbc (0x5594ae6a7f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) |
|
[default5]:[rank5]: . This may indicate a possible application crash on rank 0 or a network set up issue. |
|
W0703 22:53:56.688000 139799845013312 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1036265 closing signal SIGTERM |
|
W0703 22:53:56.688000 139799845013312 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1036266 closing signal SIGTERM |
|
W0703 22:53:56.689000 139799845013312 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1036267 closing signal SIGTERM |
|
W0703 22:53:56.689000 139799845013312 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1036268 closing signal SIGTERM |
|
W0703 22:53:56.690000 139799845013312 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1036269 closing signal SIGTERM |
|
W0703 22:53:56.691000 139799845013312 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1036270 closing signal SIGTERM |
|
E0703 22:53:57.803000 139799845013312 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 1036263) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 |
|
Traceback (most recent call last): |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in <module> |
|
sys.exit(main()) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper |
|
return f(*args, **kwargs) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main |
|
run(args) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run |
|
elastic_launch( |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ |
|
return launch_agent(self._config, self._entrypoint, list(args)) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent |
|
raise ChildFailedError( |
|
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: |
|
============================================================ |
|
/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED |
|
------------------------------------------------------------ |
|
Failures: |
|
[1]: |
|
time : 2024-07-03_22:53:56 |
|
host : ip-26-0-161-178.ec2.internal |
|
rank : 1 (local_rank: 1) |
|
exitcode : 1 (pid: 1036264) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
------------------------------------------------------------ |
|
Root Cause (first observed failure): |
|
[0]: |
|
time : 2024-07-03_22:53:56 |
|
host : ip-26-0-161-178.ec2.internal |
|
rank : 0 (local_rank: 0) |
|
exitcode : 1 (pid: 1036263) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
============================================================ |
|
srun: error: ip-26-0-161-178: task 0: Exited with exit code 1 |
|
Consider using `hf_transfer` for faster uploads. This solution comes with some limitations. See https://huggingface.co/docs/huggingface_hub/hf_transfer for more details. |
|
|