======================== START TIME: Wed Jul 3 02:58:45 UTC 2024 python3 version = Python 3.10.14 ======================== The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well. Token is valid (permission: write). Your token has been saved to /admin/home/ferdinand_mom/.cache/huggingface/token Login successful Already on 'bench_cluster' M examples/config_tiny_llama.py M examples/config_tiny_llama.yaml M examples/train_tiny_llama.sh M src/nanotron/models/llama.py M src/nanotron/trainer.py Your branch is up to date with 'origin/bench_cluster'. Job status: RUNNING W0703 02:58:48.295000 140253797914432 torch/distributed/run.py:757] W0703 02:58:48.295000 140253797914432 torch/distributed/run.py:757] ***************************************** W0703 02:58:48.295000 140253797914432 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0703 02:58:48.295000 140253797914432 torch/distributed/run.py:757] ***************************************** W0703 02:58:48.294000 139964713871168 torch/distributed/run.py:757] W0703 02:58:48.294000 139964713871168 torch/distributed/run.py:757] ***************************************** W0703 02:58:48.294000 139964713871168 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0703 02:58:48.294000 139964713871168 torch/distributed/run.py:757] ***************************************** W0703 02:58:48.295000 140152514582336 torch/distributed/run.py:757] W0703 02:58:48.295000 140152514582336 torch/distributed/run.py:757] ***************************************** W0703 02:58:48.295000 140152514582336 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0703 02:58:48.295000 140152514582336 torch/distributed/run.py:757] ***************************************** W0703 02:58:48.293000 140292847032128 torch/distributed/run.py:757] W0703 02:58:48.293000 140292847032128 torch/distributed/run.py:757] ***************************************** W0703 02:58:48.293000 140292847032128 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0703 02:58:48.293000 140292847032128 torch/distributed/run.py:757] ***************************************** W0703 02:58:48.301000 140334638495552 torch/distributed/run.py:757] W0703 02:58:48.301000 140334638495552 torch/distributed/run.py:757] ***************************************** W0703 02:58:48.301000 140334638495552 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0703 02:58:48.301000 140334638495552 torch/distributed/run.py:757] ***************************************** W0703 02:58:48.302000 140431831271232 torch/distributed/run.py:757] W0703 02:58:48.302000 140431831271232 torch/distributed/run.py:757] ***************************************** W0703 02:58:48.302000 140431831271232 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0703 02:58:48.302000 140431831271232 torch/distributed/run.py:757] ***************************************** W0703 02:58:48.307000 140571390818112 torch/distributed/run.py:757] W0703 02:58:48.307000 140571390818112 torch/distributed/run.py:757] ***************************************** W0703 02:58:48.307000 140571390818112 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0703 02:58:48.307000 140571390818112 torch/distributed/run.py:757] ***************************************** W0703 02:58:48.382000 140162037520192 torch/distributed/run.py:757] W0703 02:58:48.382000 140162037520192 torch/distributed/run.py:757] ***************************************** W0703 02:58:48.382000 140162037520192 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0703 02:58:48.382000 140162037520192 torch/distributed/run.py:757] ***************************************** [default0]:07/03/2024 02:59:08 [WARNING|DP=0|PP=0|TP=0|ip-26-0-162-233]: [Vocab Size Padding] Padded vocab (size: 50257) with 15 dummy tokens (new size: 50272) [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: Config: [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: Config(general=GeneralArgs(project='bench_cluster', [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: run='%date_%jobid', [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: seed=42, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: step=None, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: consumed_train_samples=None, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: benchmark_csv_path=None, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: ignore_sanity_checks=True), [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: parallelism=ParallelismArgs(dp=1, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: pp=4, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: tp=16, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: pp_engine=, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: tp_mode=, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: tp_linear_async_communication=False, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: expert_parallel_size=1), [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: model=ModelArgs(model_config=LlamaConfig(bos_token_id=1, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: eos_token_id=2, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: hidden_act='silu', [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: hidden_size=2048, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: initializer_range=0.02, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: intermediate_size=4096, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: is_llama_config=True, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: max_position_embeddings=4096, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: num_attention_heads=32, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: num_hidden_layers=24, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: num_key_value_heads=32, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: pad_token_id=None, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: pretraining_tp=1, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: rms_norm_eps=1e-05, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: rope_scaling=None, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: rope_theta=10000.0, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: tie_word_embeddings=True, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: use_cache=True, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: vocab_size=50272), [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: init_method=RandomInit(std=0.025), [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: dtype=torch.bfloat16, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: make_vocab_size_divisible_by=1, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: ddp_bucket_cap_mb=25), [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: tokenizer=TokenizerArgs(tokenizer_name_or_path='openai-community/gpt2', [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: tokenizer_revision=None, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: tokenizer_max_length=None), [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: checkpoints=CheckpointsArgs(checkpoints_path=Path('/dev/null'), [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: checkpoint_interval=100000, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: save_initial_state=False, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: resume_checkpoint_path=None, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: checkpoints_path_is_shared_file_system=False), [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: logging=LoggingArgs(log_level='info', [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: log_level_replica='info', [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: iteration_step_info_interval=1), [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: tokens=TokensArgs(sequence_length=4096, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: train_steps=20, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: micro_batch_size=128, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: batch_accumulation_per_replica=8, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: val_check_interval=-1, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: limit_val_batches=0, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: limit_test_batches=0), [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: optimizer=OptimizerArgs(optimizer_factory=AdamWOptimizerArgs(adam_eps=1e-08, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: adam_beta1=0.9, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: adam_beta2=0.95, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: torch_adam_is_fused=True, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: name='adamW'), [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: zero_stage=1, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: weight_decay=0.01, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: clip_grad=1.0, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: accumulate_grad_in_fp32=True, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: learning_rate_scheduler=LRSchedulerArgs(learning_rate=0.0001, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: lr_warmup_steps=1, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: lr_warmup_style='linear', [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: lr_decay_style='linear', [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: lr_decay_steps=19, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: lr_decay_starting_step=None, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: min_decay_lr=1e-05)), [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: data_stages=[DatasetStageArgs(name='Training Stage', [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: start_training_step=1, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: data=DataArgs(dataset=PretrainDatasetsArgs(hf_dataset_or_datasets='roneneldan/TinyStories', [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: hf_dataset_splits='train', [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: hf_dataset_config_name=None, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: dataset_processing_num_proc_per_process=64, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: dataset_overwrite_cache=False, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: text_column_name='text'), [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: seed=42, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: num_loading_workers=0))], [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: profiler=ProfilerArgs(profiler_export_path=Path('/fsx/ferdinandmom/ferdinand-hf/bench_cluster/results/llama-1B/64_GPUS/dp-1_tp-16_pp-4_mbz-128')), [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: lighteval=None) [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: Model Config: [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: LlamaConfig(bos_token_id=1, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: eos_token_id=2, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: hidden_act='silu', [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: hidden_size=2048, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: initializer_range=0.02, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: intermediate_size=4096, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: is_llama_config=True, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: max_position_embeddings=4096, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: num_attention_heads=32, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: num_hidden_layers=24, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: num_key_value_heads=32, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: pad_token_id=None, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: pretraining_tp=1, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: rms_norm_eps=1e-05, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: rope_scaling=None, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: rope_theta=10000.0, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: tie_word_embeddings=True, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: use_cache=True, [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: vocab_size=50272) [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: Building model.. [default0]:07/03/2024 02:59:08 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: Setting PP block ranks... [default0]:07/03/2024 02:59:25 [INFO|DP=0|PP=2|TP=8|ip-26-0-169-247]: Local number of parameters: 15.8M (30.05MiB) [default0]:07/03/2024 02:59:25 [INFO|DP=0|PP=2|TP=8|ip-26-0-169-247]: [After model building] Memory usage: 37.06MiB. Peak allocated: 39.09MiB Peak reserved: 58.00MiB [default0]:07/03/2024 02:59:25 [INFO|DP=0|PP=2|TP=8|ip-26-0-169-247]: No checkpoint path provided. [default5]:07/03/2024 02:59:25 [INFO|DP=0|PP=2|TP=13|ip-26-0-169-247]: Local number of parameters: 15.8M (30.05MiB) [default5]:07/03/2024 02:59:25 [INFO|DP=0|PP=2|TP=13|ip-26-0-169-247]: [After model building] Memory usage: 37.06MiB. Peak allocated: 39.09MiB Peak reserved: 58.00MiB [default5]:07/03/2024 02:59:25 [INFO|DP=0|PP=2|TP=13|ip-26-0-169-247]: No checkpoint path provided. [default2]:07/03/2024 02:59:25 [INFO|DP=0|PP=2|TP=10|ip-26-0-169-247]: Local number of parameters: 15.8M (30.05MiB) [default2]:07/03/2024 02:59:25 [INFO|DP=0|PP=2|TP=10|ip-26-0-169-247]: [After model building] Memory usage: 37.06MiB. Peak allocated: 39.09MiB Peak reserved: 58.00MiB [default2]:07/03/2024 02:59:25 [INFO|DP=0|PP=2|TP=10|ip-26-0-169-247]: No checkpoint path provided. [default1]:07/03/2024 02:59:25 [INFO|DP=0|PP=2|TP=9|ip-26-0-169-247]: Local number of parameters: 15.8M (30.05MiB) [default1]:07/03/2024 02:59:25 [INFO|DP=0|PP=2|TP=9|ip-26-0-169-247]: [After model building] Memory usage: 37.06MiB. Peak allocated: 39.09MiB Peak reserved: 58.00MiB [default1]:07/03/2024 02:59:25 [INFO|DP=0|PP=2|TP=9|ip-26-0-169-247]: No checkpoint path provided. [default2]:07/03/2024 02:59:25 [INFO|DP=0|PP=1|TP=10|ip-26-0-165-24]: Local number of parameters: 18.4M (35.05MiB) [default2]:07/03/2024 02:59:25 [INFO|DP=0|PP=1|TP=10|ip-26-0-165-24]: [After model building] Memory usage: 43.07MiB. Peak allocated: 45.10MiB Peak reserved: 60.00MiB [default2]:07/03/2024 02:59:25 [INFO|DP=0|PP=1|TP=10|ip-26-0-165-24]: No checkpoint path provided. [default3]:07/03/2024 02:59:25 [INFO|DP=0|PP=2|TP=11|ip-26-0-169-247]: Local number of parameters: 15.8M (30.05MiB) [default3]:07/03/2024 02:59:25 [INFO|DP=0|PP=2|TP=11|ip-26-0-169-247]: [After model building] Memory usage: 37.06MiB. Peak allocated: 39.09MiB Peak reserved: 58.00MiB [default3]:07/03/2024 02:59:25 [INFO|DP=0|PP=2|TP=11|ip-26-0-169-247]: No checkpoint path provided. [default4]:07/03/2024 02:59:25 [INFO|DP=0|PP=2|TP=12|ip-26-0-169-247]: Local number of parameters: 15.8M (30.05MiB) [default4]:07/03/2024 02:59:25 [INFO|DP=0|PP=2|TP=12|ip-26-0-169-247]: [After model building] Memory usage: 37.06MiB. Peak allocated: 39.09MiB Peak reserved: 58.00MiB [default4]:07/03/2024 02:59:25 [INFO|DP=0|PP=2|TP=12|ip-26-0-169-247]: No checkpoint path provided. [default7]:07/03/2024 02:59:25 [INFO|DP=0|PP=2|TP=15|ip-26-0-169-247]: Local number of parameters: 15.8M (30.05MiB) [default7]:07/03/2024 02:59:25 [INFO|DP=0|PP=2|TP=15|ip-26-0-169-247]: [After model building] Memory usage: 37.06MiB. Peak allocated: 39.09MiB Peak reserved: 58.00MiB [default7]:07/03/2024 02:59:25 [INFO|DP=0|PP=2|TP=15|ip-26-0-169-247]: No checkpoint path provided. [default1]:07/03/2024 02:59:25 [INFO|DP=0|PP=3|TP=1|ip-26-0-173-246]: Local number of parameters: 16.9M (32.31MiB) [default1]:07/03/2024 02:59:25 [INFO|DP=0|PP=3|TP=1|ip-26-0-173-246]: [After model building] Memory usage: 36.32MiB. Peak allocated: 38.35MiB Peak reserved: 48.00MiB [default1]:07/03/2024 02:59:25 [INFO|DP=0|PP=3|TP=1|ip-26-0-173-246]: No checkpoint path provided. [default2]:07/03/2024 02:59:25 [INFO|DP=0|PP=3|TP=2|ip-26-0-173-246]: Local number of parameters: 16.9M (32.31MiB) [default2]:07/03/2024 02:59:25 [INFO|DP=0|PP=3|TP=2|ip-26-0-173-246]: [After model building] Memory usage: 36.32MiB. Peak allocated: 38.35MiB Peak reserved: 48.00MiB [default2]:07/03/2024 02:59:25 [INFO|DP=0|PP=3|TP=2|ip-26-0-173-246]: No checkpoint path provided. [default0]:07/03/2024 02:59:25 [INFO|DP=0|PP=0|TP=8|ip-26-0-163-147]: Local number of parameters: 24.8M (47.33MiB) [default0]:07/03/2024 02:59:25 [INFO|DP=0|PP=0|TP=8|ip-26-0-163-147]: [After model building] Memory usage: 55.07MiB. Peak allocated: 57.10MiB Peak reserved: 74.00MiB [default0]:07/03/2024 02:59:25 [INFO|DP=0|PP=0|TP=8|ip-26-0-163-147]: No checkpoint path provided. [default0]:07/03/2024 02:59:25 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: Total number of parameters: 1.21G (2315.81MiB) [default0]:07/03/2024 02:59:25 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: Local number of parameters: 24.8M (47.33MiB) [default3]:07/03/2024 02:59:25 [INFO|DP=0|PP=0|TP=3|ip-26-0-162-233]: Local number of parameters: 24.8M (47.33MiB) [default3]:07/03/2024 02:59:25 [INFO|DP=0|PP=0|TP=3|ip-26-0-162-233]: [After model building] Memory usage: 55.07MiB. Peak allocated: 57.10MiB Peak reserved: 74.00MiB [default4]:07/03/2024 02:59:25 [INFO|DP=0|PP=0|TP=4|ip-26-0-162-233]: Local number of parameters: 24.8M (47.33MiB) [default4]:07/03/2024 02:59:25 [INFO|DP=0|PP=0|TP=4|ip-26-0-162-233]: [After model building] Memory usage: 55.07MiB. Peak allocated: 57.10MiB Peak reserved: 74.00MiB [default5]:07/03/2024 02:59:25 [INFO|DP=0|PP=0|TP=13|ip-26-0-163-147]: Local number of parameters: 24.8M (47.33MiB) [default5]:07/03/2024 02:59:25 [INFO|DP=0|PP=0|TP=13|ip-26-0-163-147]: [After model building] Memory usage: 55.07MiB. Peak allocated: 57.10MiB Peak reserved: 74.00MiB [default5]:07/03/2024 02:59:25 [INFO|DP=0|PP=0|TP=13|ip-26-0-163-147]: No checkpoint path provided. [default3]:07/03/2024 02:59:25 [INFO|DP=0|PP=0|TP=3|ip-26-0-162-233]: No checkpoint path provided. [default0]:07/03/2024 02:59:25 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: [After model building] Memory usage: 55.07MiB. Peak allocated: 57.10MiB Peak reserved: 74.00MiB [default4]:07/03/2024 02:59:25 [INFO|DP=0|PP=0|TP=4|ip-26-0-162-233]: No checkpoint path provided. [default0]:07/03/2024 02:59:25 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: No checkpoint path provided. [default0]:07/03/2024 02:59:25 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: Parametrizing model parameters using StandardParametrizator [default6]:07/03/2024 02:59:25 [INFO|DP=0|PP=3|TP=6|ip-26-0-173-246]: Local number of parameters: 16.9M (32.31MiB) [default6]:07/03/2024 02:59:25 [INFO|DP=0|PP=3|TP=6|ip-26-0-173-246]: [After model building] Memory usage: 36.32MiB. Peak allocated: 38.35MiB Peak reserved: 48.00MiB [default3]:07/03/2024 02:59:25 [INFO|DP=0|PP=2|TP=3|ip-26-0-169-139]: Local number of parameters: 15.8M (30.05MiB) [default3]:07/03/2024 02:59:25 [INFO|DP=0|PP=2|TP=3|ip-26-0-169-139]: [After model building] Memory usage: 37.06MiB. Peak allocated: 39.09MiB Peak reserved: 58.00MiB [default3]:07/03/2024 02:59:25 [INFO|DP=0|PP=2|TP=3|ip-26-0-169-139]: No checkpoint path provided. [default5]:07/03/2024 02:59:25 [INFO|DP=0|PP=2|TP=5|ip-26-0-169-139]: Local number of parameters: 15.8M (30.05MiB) [default5]:07/03/2024 02:59:25 [INFO|DP=0|PP=2|TP=5|ip-26-0-169-139]: [After model building] Memory usage: 37.06MiB. Peak allocated: 39.09MiB Peak reserved: 58.00MiB [default5]:07/03/2024 02:59:25 [INFO|DP=0|PP=2|TP=5|ip-26-0-169-139]: No checkpoint path provided. [default6]:07/03/2024 02:59:25 [INFO|DP=0|PP=3|TP=6|ip-26-0-173-246]: No checkpoint path provided. [default5]:07/03/2024 02:59:25 [INFO|DP=0|PP=0|TP=5|ip-26-0-162-233]: Local number of parameters: 24.8M (47.33MiB) [default2]:07/03/2024 02:59:25 [INFO|DP=0|PP=0|TP=10|ip-26-0-163-147]: Local number of parameters: 24.8M (47.33MiB) [default1]:07/03/2024 02:59:25 [INFO|DP=0|PP=0|TP=1|ip-26-0-162-233]: Local number of parameters: 24.8M (47.33MiB) [default1]:07/03/2024 02:59:25 [INFO|DP=0|PP=0|TP=1|ip-26-0-162-233]: [After model building] Memory usage: 55.07MiB. Peak allocated: 57.10MiB Peak reserved: 74.00MiB [default5]:07/03/2024 02:59:25 [INFO|DP=0|PP=0|TP=5|ip-26-0-162-233]: [After model building] Memory usage: 55.07MiB. Peak allocated: 57.10MiB Peak reserved: 74.00MiB [default2]:07/03/2024 02:59:25 [INFO|DP=0|PP=0|TP=10|ip-26-0-163-147]: [After model building] Memory usage: 55.07MiB. Peak allocated: 57.10MiB Peak reserved: 74.00MiB [default2]:07/03/2024 02:59:25 [INFO|DP=0|PP=0|TP=10|ip-26-0-163-147]: No checkpoint path provided. [default5]:07/03/2024 02:59:25 [INFO|DP=0|PP=0|TP=5|ip-26-0-162-233]: No checkpoint path provided. [default1]:07/03/2024 02:59:25 [INFO|DP=0|PP=0|TP=1|ip-26-0-162-233]: No checkpoint path provided. [default4]:07/03/2024 02:59:25 [INFO|DP=0|PP=0|TP=12|ip-26-0-163-147]: Local number of parameters: 24.8M (47.33MiB) [default4]:07/03/2024 02:59:25 [INFO|DP=0|PP=0|TP=12|ip-26-0-163-147]: [After model building] Memory usage: 55.07MiB. Peak allocated: 57.10MiB Peak reserved: 74.00MiB [default4]:07/03/2024 02:59:25 [INFO|DP=0|PP=0|TP=12|ip-26-0-163-147]: No checkpoint path provided. [default6]:07/03/2024 02:59:25 [INFO|DP=0|PP=0|TP=6|ip-26-0-162-233]: Local number of parameters: 24.8M (47.33MiB) [default6]:07/03/2024 02:59:25 [INFO|DP=0|PP=0|TP=6|ip-26-0-162-233]: [After model building] Memory usage: 55.07MiB. Peak allocated: 57.10MiB Peak reserved: 74.00MiB [default6]:07/03/2024 02:59:25 [INFO|DP=0|PP=0|TP=6|ip-26-0-162-233]: No checkpoint path provided. [default7]:07/03/2024 02:59:25 [INFO|DP=0|PP=0|TP=7|ip-26-0-162-233]: Local number of parameters: 24.8M (47.33MiB) [default7]:07/03/2024 02:59:25 [INFO|DP=0|PP=0|TP=7|ip-26-0-162-233]: [After model building] Memory usage: 55.07MiB. Peak allocated: 57.10MiB Peak reserved: 74.00MiB [default7]:07/03/2024 02:59:25 [INFO|DP=0|PP=0|TP=7|ip-26-0-162-233]: No checkpoint path provided. [default6]:07/03/2024 02:59:25 [INFO|DP=0|PP=2|TP=14|ip-26-0-169-247]: Local number of parameters: 15.8M (30.05MiB) [default6]:07/03/2024 02:59:25 [INFO|DP=0|PP=2|TP=14|ip-26-0-169-247]: [After model building] Memory usage: 37.06MiB. Peak allocated: 39.09MiB Peak reserved: 58.00MiB [default6]:07/03/2024 02:59:25 [INFO|DP=0|PP=3|TP=14|ip-26-0-174-36]: Local number of parameters: 16.9M (32.31MiB) [default7]:07/03/2024 02:59:25 [INFO|DP=0|PP=3|TP=15|ip-26-0-174-36]: Local number of parameters: 16.9M (32.31MiB) [default7]:07/03/2024 02:59:25 [INFO|DP=0|PP=3|TP=15|ip-26-0-174-36]: [After model building] Memory usage: 36.32MiB. Peak allocated: 38.35MiB Peak reserved: 48.00MiB [default7]:07/03/2024 02:59:25 [INFO|DP=0|PP=3|TP=15|ip-26-0-174-36]: No checkpoint path provided. [default2]:07/03/2024 02:59:25 [INFO|DP=0|PP=2|TP=2|ip-26-0-169-139]: Local number of parameters: 15.8M (30.05MiB) [default2]:07/03/2024 02:59:25 [INFO|DP=0|PP=2|TP=2|ip-26-0-169-139]: [After model building] Memory usage: 37.06MiB. Peak allocated: 39.09MiB Peak reserved: 58.00MiB [default2]:07/03/2024 02:59:25 [INFO|DP=0|PP=2|TP=2|ip-26-0-169-139]: No checkpoint path provided. [default6]:07/03/2024 02:59:25 [INFO|DP=0|PP=2|TP=14|ip-26-0-169-247]: No checkpoint path provided. [default0]:07/03/2024 02:59:25 [INFO|DP=0|PP=1|TP=8|ip-26-0-165-24]: Local number of parameters: 18.4M (35.05MiB) [default6]:07/03/2024 02:59:25 [INFO|DP=0|PP=3|TP=14|ip-26-0-174-36]: [After model building] Memory usage: 36.32MiB. Peak allocated: 38.35MiB Peak reserved: 48.00MiB [default6]:07/03/2024 02:59:25 [INFO|DP=0|PP=3|TP=14|ip-26-0-174-36]: No checkpoint path provided. [default0]:07/03/2024 02:59:25 [INFO|DP=0|PP=1|TP=8|ip-26-0-165-24]: [After model building] Memory usage: 43.07MiB. Peak allocated: 45.10MiB Peak reserved: 60.00MiB [default0]:07/03/2024 02:59:25 [INFO|DP=0|PP=1|TP=8|ip-26-0-165-24]: No checkpoint path provided. [default5]:07/03/2024 02:59:25 [INFO|DP=0|PP=3|TP=13|ip-26-0-174-36]: Local number of parameters: 16.9M (32.31MiB) [default7]:07/03/2024 02:59:25 [INFO|DP=0|PP=0|TP=15|ip-26-0-163-147]: Local number of parameters: 24.8M (47.33MiB) [default5]:07/03/2024 02:59:25 [INFO|DP=0|PP=3|TP=13|ip-26-0-174-36]: [After model building] Memory usage: 36.32MiB. Peak allocated: 38.35MiB Peak reserved: 48.00MiB [default5]:07/03/2024 02:59:25 [INFO|DP=0|PP=3|TP=13|ip-26-0-174-36]: No checkpoint path provided. [default0]:07/03/2024 02:59:25 [INFO|DP=0|PP=3|TP=8|ip-26-0-174-36]: Local number of parameters: 16.9M (32.31MiB) [default0]:07/03/2024 02:59:25 [INFO|DP=0|PP=3|TP=8|ip-26-0-174-36]: [After model building] Memory usage: 36.32MiB. Peak allocated: 38.35MiB Peak reserved: 48.00MiB [default0]:07/03/2024 02:59:25 [INFO|DP=0|PP=3|TP=8|ip-26-0-174-36]: No checkpoint path provided. [default1]:07/03/2024 02:59:25 [INFO|DP=0|PP=0|TP=9|ip-26-0-163-147]: Local number of parameters: 24.8M (47.33MiB) [default1]:07/03/2024 02:59:25 [INFO|DP=0|PP=0|TP=9|ip-26-0-163-147]: [After model building] Memory usage: 55.07MiB. Peak allocated: 57.10MiB Peak reserved: 74.00MiB [default1]:07/03/2024 02:59:25 [INFO|DP=0|PP=0|TP=9|ip-26-0-163-147]: No checkpoint path provided. [default6]:07/03/2024 02:59:25 [INFO|DP=0|PP=0|TP=14|ip-26-0-163-147]: Local number of parameters: 24.8M (47.33MiB) [default6]:07/03/2024 02:59:25 [INFO|DP=0|PP=0|TP=14|ip-26-0-163-147]: [After model building] Memory usage: 55.07MiB. Peak allocated: 57.10MiB Peak reserved: 74.00MiB [default6]:07/03/2024 02:59:25 [INFO|DP=0|PP=0|TP=14|ip-26-0-163-147]: No checkpoint path provided. [default7]:07/03/2024 02:59:25 [INFO|DP=0|PP=0|TP=15|ip-26-0-163-147]: [After model building] Memory usage: 55.07MiB. Peak allocated: 57.10MiB Peak reserved: 74.00MiB [default7]:07/03/2024 02:59:25 [INFO|DP=0|PP=0|TP=15|ip-26-0-163-147]: No checkpoint path provided. [default4]:07/03/2024 02:59:25 [INFO|DP=0|PP=2|TP=4|ip-26-0-169-139]: Local number of parameters: 15.8M (30.05MiB) [default4]:07/03/2024 02:59:25 [INFO|DP=0|PP=2|TP=4|ip-26-0-169-139]: [After model building] Memory usage: 37.06MiB. Peak allocated: 39.09MiB Peak reserved: 58.00MiB [default4]:07/03/2024 02:59:25 [INFO|DP=0|PP=2|TP=4|ip-26-0-169-139]: No checkpoint path provided. [default1]:07/03/2024 02:59:25 [INFO|DP=0|PP=1|TP=9|ip-26-0-165-24]: Local number of parameters: 18.4M (35.05MiB) [default1]:07/03/2024 02:59:25 [INFO|DP=0|PP=1|TP=9|ip-26-0-165-24]: [After model building] Memory usage: 43.07MiB. Peak allocated: 45.10MiB Peak reserved: 60.00MiB [default1]:07/03/2024 02:59:25 [INFO|DP=0|PP=1|TP=9|ip-26-0-165-24]: No checkpoint path provided. [default7]:07/03/2024 02:59:25 [INFO|DP=0|PP=1|TP=15|ip-26-0-165-24]: Local number of parameters: 18.4M (35.05MiB) [default7]:07/03/2024 02:59:25 [INFO|DP=0|PP=1|TP=15|ip-26-0-165-24]: [After model building] Memory usage: 43.07MiB. Peak allocated: 45.10MiB Peak reserved: 60.00MiB [default7]:07/03/2024 02:59:25 [INFO|DP=0|PP=1|TP=15|ip-26-0-165-24]: No checkpoint path provided. [default7]:07/03/2024 02:59:25 [INFO|DP=0|PP=3|TP=7|ip-26-0-173-246]: Local number of parameters: 16.9M (32.31MiB) [default7]:07/03/2024 02:59:25 [INFO|DP=0|PP=3|TP=7|ip-26-0-173-246]: [After model building] Memory usage: 36.32MiB. Peak allocated: 38.35MiB Peak reserved: 48.00MiB [default7]:07/03/2024 02:59:25 [INFO|DP=0|PP=3|TP=7|ip-26-0-173-246]: No checkpoint path provided. [default0]:07/03/2024 02:59:25 [INFO|DP=0|PP=3|TP=0|ip-26-0-173-246]: Local number of parameters: 16.9M (32.31MiB) [default0]:07/03/2024 02:59:25 [INFO|DP=0|PP=3|TP=0|ip-26-0-173-246]: [After model building] Memory usage: 36.32MiB. Peak allocated: 38.35MiB Peak reserved: 48.00MiB [default0]:07/03/2024 02:59:25 [INFO|DP=0|PP=3|TP=0|ip-26-0-173-246]: No checkpoint path provided. [default2]:07/03/2024 02:59:25 [INFO|DP=0|PP=0|TP=2|ip-26-0-162-233]: Local number of parameters: 24.8M (47.33MiB) [default2]:07/03/2024 02:59:25 [INFO|DP=0|PP=0|TP=2|ip-26-0-162-233]: [After model building] Memory usage: 55.07MiB. Peak allocated: 57.10MiB Peak reserved: 74.00MiB [default3]:07/03/2024 02:59:25 [INFO|DP=0|PP=3|TP=3|ip-26-0-173-246]: Local number of parameters: 16.9M (32.31MiB) [default2]:07/03/2024 02:59:25 [INFO|DP=0|PP=0|TP=2|ip-26-0-162-233]: No checkpoint path provided. [default1]:07/03/2024 02:59:25 [INFO|DP=0|PP=2|TP=1|ip-26-0-169-139]: Local number of parameters: 15.8M (30.05MiB) [default3]:07/03/2024 02:59:25 [INFO|DP=0|PP=3|TP=3|ip-26-0-173-246]: [After model building] Memory usage: 36.32MiB. Peak allocated: 38.35MiB Peak reserved: 48.00MiB [default3]:07/03/2024 02:59:25 [INFO|DP=0|PP=3|TP=3|ip-26-0-173-246]: No checkpoint path provided. [default3]:07/03/2024 02:59:25 [INFO|DP=0|PP=0|TP=11|ip-26-0-163-147]: Local number of parameters: 24.8M (47.33MiB) [default1]:07/03/2024 02:59:25 [INFO|DP=0|PP=2|TP=1|ip-26-0-169-139]: [After model building] Memory usage: 37.06MiB. Peak allocated: 39.09MiB Peak reserved: 58.00MiB [default1]:07/03/2024 02:59:25 [INFO|DP=0|PP=2|TP=1|ip-26-0-169-139]: No checkpoint path provided. [default5]:07/03/2024 02:59:25 [INFO|DP=0|PP=3|TP=5|ip-26-0-173-246]: Local number of parameters: 16.9M (32.31MiB) [default3]:07/03/2024 02:59:25 [INFO|DP=0|PP=0|TP=11|ip-26-0-163-147]: [After model building] Memory usage: 55.07MiB. Peak allocated: 57.10MiB Peak reserved: 74.00MiB [default3]:07/03/2024 02:59:25 [INFO|DP=0|PP=0|TP=11|ip-26-0-163-147]: No checkpoint path provided. [default5]:07/03/2024 02:59:25 [INFO|DP=0|PP=3|TP=5|ip-26-0-173-246]: [After model building] Memory usage: 36.32MiB. Peak allocated: 38.35MiB Peak reserved: 48.00MiB [default5]:07/03/2024 02:59:25 [INFO|DP=0|PP=3|TP=5|ip-26-0-173-246]: No checkpoint path provided. [default7]:07/03/2024 02:59:25 [INFO|DP=0|PP=2|TP=7|ip-26-0-169-139]: Local number of parameters: 15.8M (30.05MiB) [default7]:07/03/2024 02:59:25 [INFO|DP=0|PP=2|TP=7|ip-26-0-169-139]: [After model building] Memory usage: 37.06MiB. Peak allocated: 39.09MiB Peak reserved: 58.00MiB [default4]:07/03/2024 02:59:25 [INFO|DP=0|PP=3|TP=4|ip-26-0-173-246]: Local number of parameters: 16.9M (32.31MiB) [default7]:07/03/2024 02:59:25 [INFO|DP=0|PP=2|TP=7|ip-26-0-169-139]: No checkpoint path provided. [default4]:07/03/2024 02:59:25 [INFO|DP=0|PP=3|TP=4|ip-26-0-173-246]: [After model building] Memory usage: 36.32MiB. Peak allocated: 38.35MiB Peak reserved: 48.00MiB [default4]:07/03/2024 02:59:25 [INFO|DP=0|PP=3|TP=4|ip-26-0-173-246]: No checkpoint path provided. [default0]:07/03/2024 02:59:25 [INFO|DP=0|PP=1|TP=0|ip-26-0-164-207]: Local number of parameters: 18.4M (35.05MiB) [default5]:07/03/2024 02:59:25 [INFO|DP=0|PP=1|TP=5|ip-26-0-164-207]: Local number of parameters: 18.4M (35.05MiB) [default5]:07/03/2024 02:59:25 [INFO|DP=0|PP=1|TP=5|ip-26-0-164-207]: [After model building] Memory usage: 43.07MiB. Peak allocated: 45.10MiB Peak reserved: 60.00MiB [default5]:07/03/2024 02:59:25 [INFO|DP=0|PP=1|TP=5|ip-26-0-164-207]: No checkpoint path provided. [default0]:07/03/2024 02:59:25 [INFO|DP=0|PP=1|TP=0|ip-26-0-164-207]: [After model building] Memory usage: 43.07MiB. Peak allocated: 45.10MiB Peak reserved: 60.00MiB [default0]:07/03/2024 02:59:25 [INFO|DP=0|PP=1|TP=0|ip-26-0-164-207]: No checkpoint path provided. [default4]:07/03/2024 02:59:25 [INFO|DP=0|PP=1|TP=4|ip-26-0-164-207]: Local number of parameters: 18.4M (35.05MiB) [default4]:07/03/2024 02:59:25 [INFO|DP=0|PP=1|TP=4|ip-26-0-164-207]: [After model building] Memory usage: 43.07MiB. Peak allocated: 45.10MiB Peak reserved: 60.00MiB [default4]:07/03/2024 02:59:25 [INFO|DP=0|PP=1|TP=4|ip-26-0-164-207]: No checkpoint path provided. [default1]:07/03/2024 02:59:25 [INFO|DP=0|PP=1|TP=1|ip-26-0-164-207]: Local number of parameters: 18.4M (35.05MiB) [default1]:07/03/2024 02:59:25 [INFO|DP=0|PP=1|TP=1|ip-26-0-164-207]: [After model building] Memory usage: 43.07MiB. Peak allocated: 45.10MiB Peak reserved: 60.00MiB [default1]:07/03/2024 02:59:25 [INFO|DP=0|PP=1|TP=1|ip-26-0-164-207]: No checkpoint path provided. [default1]:07/03/2024 02:59:25 [INFO|DP=0|PP=3|TP=9|ip-26-0-174-36]: Local number of parameters: 16.9M (32.31MiB) [default1]:07/03/2024 02:59:25 [INFO|DP=0|PP=3|TP=9|ip-26-0-174-36]: [After model building] Memory usage: 36.32MiB. Peak allocated: 38.35MiB Peak reserved: 48.00MiB [default7]:07/03/2024 02:59:25 [INFO|DP=0|PP=1|TP=7|ip-26-0-164-207]: Local number of parameters: 18.4M (35.05MiB) [default7]:07/03/2024 02:59:25 [INFO|DP=0|PP=1|TP=7|ip-26-0-164-207]: [After model building] Memory usage: 43.07MiB. Peak allocated: 45.10MiB Peak reserved: 60.00MiB [default1]:07/03/2024 02:59:25 [INFO|DP=0|PP=3|TP=9|ip-26-0-174-36]: No checkpoint path provided. [default3]:07/03/2024 02:59:25 [INFO|DP=0|PP=3|TP=11|ip-26-0-174-36]: Local number of parameters: 16.9M (32.31MiB) [default7]:07/03/2024 02:59:25 [INFO|DP=0|PP=1|TP=7|ip-26-0-164-207]: No checkpoint path provided. [default3]:07/03/2024 02:59:25 [INFO|DP=0|PP=1|TP=11|ip-26-0-165-24]: Local number of parameters: 18.4M (35.05MiB) [default3]:07/03/2024 02:59:25 [INFO|DP=0|PP=1|TP=11|ip-26-0-165-24]: [After model building] Memory usage: 43.07MiB. Peak allocated: 45.10MiB Peak reserved: 60.00MiB [default3]:07/03/2024 02:59:25 [INFO|DP=0|PP=1|TP=11|ip-26-0-165-24]: No checkpoint path provided. [default3]:07/03/2024 02:59:25 [INFO|DP=0|PP=3|TP=11|ip-26-0-174-36]: [After model building] Memory usage: 36.32MiB. Peak allocated: 38.35MiB Peak reserved: 48.00MiB [default3]:07/03/2024 02:59:25 [INFO|DP=0|PP=3|TP=11|ip-26-0-174-36]: No checkpoint path provided. [default5]:07/03/2024 02:59:25 [INFO|DP=0|PP=1|TP=13|ip-26-0-165-24]: Local number of parameters: 18.4M (35.05MiB) [default5]:07/03/2024 02:59:25 [INFO|DP=0|PP=1|TP=13|ip-26-0-165-24]: [After model building] Memory usage: 43.07MiB. Peak allocated: 45.10MiB Peak reserved: 60.00MiB [default4]:07/03/2024 02:59:25 [INFO|DP=0|PP=1|TP=12|ip-26-0-165-24]: Local number of parameters: 18.4M (35.05MiB) [default4]:07/03/2024 02:59:25 [INFO|DP=0|PP=1|TP=12|ip-26-0-165-24]: [After model building] Memory usage: 43.07MiB. Peak allocated: 45.10MiB Peak reserved: 60.00MiB [default5]:07/03/2024 02:59:25 [INFO|DP=0|PP=1|TP=13|ip-26-0-165-24]: No checkpoint path provided. [default4]:07/03/2024 02:59:25 [INFO|DP=0|PP=1|TP=12|ip-26-0-165-24]: No checkpoint path provided. [default2]:07/03/2024 02:59:25 [INFO|DP=0|PP=3|TP=10|ip-26-0-174-36]: Local number of parameters: 16.9M (32.31MiB) [default2]:07/03/2024 02:59:25 [INFO|DP=0|PP=3|TP=10|ip-26-0-174-36]: [After model building] Memory usage: 36.32MiB. Peak allocated: 38.35MiB Peak reserved: 48.00MiB [default2]:07/03/2024 02:59:25 [INFO|DP=0|PP=3|TP=10|ip-26-0-174-36]: No checkpoint path provided. [default6]:07/03/2024 02:59:25 [INFO|DP=0|PP=2|TP=6|ip-26-0-169-139]: Local number of parameters: 15.8M (30.05MiB) [default6]:07/03/2024 02:59:25 [INFO|DP=0|PP=2|TP=6|ip-26-0-169-139]: [After model building] Memory usage: 37.06MiB. Peak allocated: 39.09MiB Peak reserved: 58.00MiB [default6]:07/03/2024 02:59:25 [INFO|DP=0|PP=2|TP=6|ip-26-0-169-139]: No checkpoint path provided. [default6]:07/03/2024 02:59:25 [INFO|DP=0|PP=1|TP=14|ip-26-0-165-24]: Local number of parameters: 18.4M (35.05MiB) [default6]:07/03/2024 02:59:25 [INFO|DP=0|PP=1|TP=14|ip-26-0-165-24]: [After model building] Memory usage: 43.07MiB. Peak allocated: 45.10MiB Peak reserved: 60.00MiB [default6]:07/03/2024 02:59:25 [INFO|DP=0|PP=1|TP=14|ip-26-0-165-24]: No checkpoint path provided. [default0]:07/03/2024 02:59:25 [INFO|DP=0|PP=2|TP=0|ip-26-0-169-139]: Local number of parameters: 15.8M (30.05MiB) [default0]:07/03/2024 02:59:25 [INFO|DP=0|PP=2|TP=0|ip-26-0-169-139]: [After model building] Memory usage: 37.06MiB. Peak allocated: 39.09MiB Peak reserved: 58.00MiB [default0]:07/03/2024 02:59:25 [INFO|DP=0|PP=2|TP=0|ip-26-0-169-139]: No checkpoint path provided. [default4]:07/03/2024 02:59:25 [INFO|DP=0|PP=3|TP=12|ip-26-0-174-36]: Local number of parameters: 16.9M (32.31MiB) [default4]:07/03/2024 02:59:25 [INFO|DP=0|PP=3|TP=12|ip-26-0-174-36]: [After model building] Memory usage: 36.32MiB. Peak allocated: 38.35MiB Peak reserved: 48.00MiB [default4]:07/03/2024 02:59:25 [INFO|DP=0|PP=3|TP=12|ip-26-0-174-36]: No checkpoint path provided. [default2]:07/03/2024 02:59:25 [INFO|DP=0|PP=1|TP=2|ip-26-0-164-207]: Local number of parameters: 18.4M (35.05MiB) [default2]:07/03/2024 02:59:25 [INFO|DP=0|PP=1|TP=2|ip-26-0-164-207]: [After model building] Memory usage: 43.07MiB. Peak allocated: 45.10MiB Peak reserved: 60.00MiB [default2]:07/03/2024 02:59:25 [INFO|DP=0|PP=1|TP=2|ip-26-0-164-207]: No checkpoint path provided. [default6]:07/03/2024 02:59:25 [INFO|DP=0|PP=1|TP=6|ip-26-0-164-207]: Local number of parameters: 18.4M (35.05MiB) [default6]:07/03/2024 02:59:25 [INFO|DP=0|PP=1|TP=6|ip-26-0-164-207]: [After model building] Memory usage: 43.07MiB. Peak allocated: 45.10MiB Peak reserved: 60.00MiB [default6]:07/03/2024 02:59:25 [INFO|DP=0|PP=1|TP=6|ip-26-0-164-207]: No checkpoint path provided. [default3]:07/03/2024 02:59:25 [INFO|DP=0|PP=1|TP=3|ip-26-0-164-207]: Local number of parameters: 18.4M (35.05MiB) [default3]:07/03/2024 02:59:25 [INFO|DP=0|PP=1|TP=3|ip-26-0-164-207]: [After model building] Memory usage: 43.07MiB. Peak allocated: 45.10MiB Peak reserved: 60.00MiB [default3]:07/03/2024 02:59:25 [INFO|DP=0|PP=1|TP=3|ip-26-0-164-207]: No checkpoint path provided. [default0]:07/03/2024 02:59:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: [Optimizer Building] Using LearningRateForSP as learning rate [default0]:07/03/2024 02:59:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: [ZeRO sharding] Size of optimizer params per rank: [default0]:07/03/2024 02:59:27 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: [ZeRO sharding] DP Rank 0 has 24.8M out of 24.8M (100.00%) params' optimizer states [default0]:07/03/2024 02:59:29 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: [Training Plan] Stage Training Stage has 19 remaining training steps and has consumed 0 samples [default0]:07/03/2024 02:59:29 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: Using `datasets` library [default0]:07/03/2024 02:59:29 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: Loading tokenizer from openai-community/gpt2 and transformers/hf_hub versions ('4.41.2', '0.23.4') [default0]:07/03/2024 02:59:29 [WARNING|DP=0|PP=0|TP=0|ip-26-0-162-233]: Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default0]:07/03/2024 02:59:31 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: [Training Plan] There are 1 training stages [default0]:07/03/2024 02:59:31 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: [Stage Training Stage] start from step 1 [default0]:07/03/2024 02:59:31 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: [default0]:07/03/2024 02:59:31 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: [Start training] datetime: 2024-07-03 02:59:31.221273 | mbs: 128 | grad_accum: 8 | global_batch_size: 1024 | sequence_length: 4096 | train_steps: 20 | start_iteration_step: 0 | consumed_train_samples: 0 [default0]:07/03/2024 02:59:31 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: Resuming training from stage Training Stage, it has trained for 0 samples and has 19 remaining train steps [default0]:07/03/2024 02:59:31 [INFO|DP=0|PP=0|TP=0|ip-26-0-162-233]: Memory usage: 244.38MiB. Peak allocated 244.38MiB. Peak reserved: 266.00MiB [default3]:07/03/2024 02:59:31 [WARNING|DP=0|PP=0|TP=3|ip-26-0-162-233]: Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default0]:07/03/2024 02:59:31 [WARNING|DP=0|PP=2|TP=8|ip-26-0-169-247]: Repo card metadata block was not found. Setting CardData to empty. [default5]:07/03/2024 02:59:31 [WARNING|DP=0|PP=2|TP=13|ip-26-0-169-247]: Repo card metadata block was not found. Setting CardData to empty. [default2]:07/03/2024 02:59:31 [WARNING|DP=0|PP=2|TP=10|ip-26-0-169-247]: Repo card metadata block was not found. Setting CardData to empty. [default1]:07/03/2024 02:59:31 [WARNING|DP=0|PP=2|TP=9|ip-26-0-169-247]: Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default2]:07/03/2024 02:59:31 [WARNING|DP=0|PP=1|TP=10|ip-26-0-165-24]: Repo card metadata block was not found. Setting CardData to empty. [default4]:07/03/2024 02:59:31 [WARNING|DP=0|PP=2|TP=12|ip-26-0-169-247]: Repo card metadata block was not found. Setting CardData to empty. [default3]:07/03/2024 02:59:31 [WARNING|DP=0|PP=2|TP=11|ip-26-0-169-247]: Repo card metadata block was not found. Setting CardData to empty. [default7]:07/03/2024 02:59:31 [WARNING|DP=0|PP=2|TP=15|ip-26-0-169-247]: Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default2]:07/03/2024 02:59:31 [WARNING|DP=0|PP=3|TP=2|ip-26-0-173-246]: Repo card metadata block was not found. Setting CardData to empty. [default5]:07/03/2024 02:59:31 [WARNING|DP=0|PP=0|TP=13|ip-26-0-163-147]: Repo card metadata block was not found. Setting CardData to empty. [default0]:07/03/2024 02:59:31 [WARNING|DP=0|PP=0|TP=8|ip-26-0-163-147]: Repo card metadata block was not found. Setting CardData to empty. [default3]:07/03/2024 02:59:31 [WARNING|DP=0|PP=2|TP=3|ip-26-0-169-139]: Repo card metadata block was not found. Setting CardData to empty. [default5]:07/03/2024 02:59:31 [WARNING|DP=0|PP=2|TP=5|ip-26-0-169-139]: Repo card metadata block was not found. Setting CardData to empty. [default6]:07/03/2024 02:59:31 [WARNING|DP=0|PP=3|TP=6|ip-26-0-173-246]: Repo card metadata block was not found. Setting CardData to empty. [default4]:07/03/2024 02:59:31 [WARNING|DP=0|PP=0|TP=4|ip-26-0-162-233]: Repo card metadata block was not found. Setting CardData to empty. [default2]:07/03/2024 02:59:31 [WARNING|DP=0|PP=0|TP=10|ip-26-0-163-147]: Repo card metadata block was not found. Setting CardData to empty. [default1]:07/03/2024 02:59:31 [WARNING|DP=0|PP=0|TP=1|ip-26-0-162-233]: Repo card metadata block was not found. Setting CardData to empty. [default7]:07/03/2024 02:59:31 [WARNING|DP=0|PP=3|TP=15|ip-26-0-174-36]: Repo card metadata block was not found. Setting CardData to empty. [default6]:07/03/2024 02:59:31 [WARNING|DP=0|PP=3|TP=14|ip-26-0-174-36]: Repo card metadata block was not found. Setting CardData to empty. [default5]:07/03/2024 02:59:31 [WARNING|DP=0|PP=0|TP=5|ip-26-0-162-233]: Repo card metadata block was not found. Setting CardData to empty. [default6]:07/03/2024 02:59:31 [WARNING|DP=0|PP=0|TP=6|ip-26-0-162-233]: Repo card metadata block was not found. Setting CardData to empty. [default7]:07/03/2024 02:59:31 [WARNING|DP=0|PP=0|TP=7|ip-26-0-162-233]: Repo card metadata block was not found. Setting CardData to empty. [default0]:07/03/2024 02:59:31 [WARNING|DP=0|PP=3|TP=8|ip-26-0-174-36]: Repo card metadata block was not found. Setting CardData to empty. [default5]:07/03/2024 02:59:31 [WARNING|DP=0|PP=3|TP=13|ip-26-0-174-36]: Repo card metadata block was not found. Setting CardData to empty. [default6]:07/03/2024 02:59:31 [WARNING|DP=0|PP=2|TP=14|ip-26-0-169-247]: Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default0]:07/03/2024 02:59:31 [WARNING|DP=0|PP=1|TP=8|ip-26-0-165-24]: Repo card metadata block was not found. Setting CardData to empty. [default2]:07/03/2024 02:59:31 [WARNING|DP=0|PP=2|TP=2|ip-26-0-169-139]: Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default1]:07/03/2024 02:59:31 [WARNING|DP=0|PP=0|TP=9|ip-26-0-163-147]: Repo card metadata block was not found. Setting CardData to empty. [default6]:07/03/2024 02:59:31 [WARNING|DP=0|PP=0|TP=14|ip-26-0-163-147]: Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default7]:07/03/2024 02:59:31 [WARNING|DP=0|PP=0|TP=15|ip-26-0-163-147]: Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default0]:07/03/2024 02:59:31 [WARNING|DP=0|PP=3|TP=0|ip-26-0-173-246]: Repo card metadata block was not found. Setting CardData to empty. [default3]:07/03/2024 02:59:31 [WARNING|DP=0|PP=3|TP=3|ip-26-0-173-246]: Repo card metadata block was not found. Setting CardData to empty. [default7]:07/03/2024 02:59:31 [WARNING|DP=0|PP=3|TP=7|ip-26-0-173-246]: Repo card metadata block was not found. Setting CardData to empty. [default7]:07/03/2024 02:59:31 [WARNING|DP=0|PP=1|TP=15|ip-26-0-165-24]: Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default2]:07/03/2024 02:59:31 [WARNING|DP=0|PP=0|TP=2|ip-26-0-162-233]: Repo card metadata block was not found. Setting CardData to empty. [default1]:07/03/2024 02:59:31 [WARNING|DP=0|PP=2|TP=1|ip-26-0-169-139]: Repo card metadata block was not found. Setting CardData to empty. [default4]:07/03/2024 02:59:31 [WARNING|DP=0|PP=3|TP=4|ip-26-0-173-246]: Repo card metadata block was not found. Setting CardData to empty. [default3]:07/03/2024 02:59:31 [WARNING|DP=0|PP=0|TP=11|ip-26-0-163-147]: Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default4]:07/03/2024 02:59:31 [WARNING|DP=0|PP=2|TP=4|ip-26-0-169-139]: Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default5]:07/03/2024 02:59:31 [WARNING|DP=0|PP=3|TP=5|ip-26-0-173-246]: Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default4]:07/03/2024 02:59:31 [WARNING|DP=0|PP=1|TP=12|ip-26-0-165-24]: Repo card metadata block was not found. Setting CardData to empty. [default5]:07/03/2024 02:59:31 [WARNING|DP=0|PP=1|TP=13|ip-26-0-165-24]: Repo card metadata block was not found. Setting CardData to empty. [default1]:07/03/2024 02:59:31 [WARNING|DP=0|PP=3|TP=9|ip-26-0-174-36]: Repo card metadata block was not found. Setting CardData to empty. [default5]:07/03/2024 02:59:31 [WARNING|DP=0|PP=1|TP=5|ip-26-0-164-207]: Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default3]:07/03/2024 02:59:31 [WARNING|DP=0|PP=3|TP=11|ip-26-0-174-36]: Repo card metadata block was not found. Setting CardData to empty. [default1]:07/03/2024 02:59:31 [WARNING|DP=0|PP=1|TP=1|ip-26-0-164-207]: Repo card metadata block was not found. Setting CardData to empty. [default0]:07/03/2024 02:59:31 [WARNING|DP=0|PP=1|TP=0|ip-26-0-164-207]: Repo card metadata block was not found. Setting CardData to empty. [default4]:07/03/2024 02:59:31 [WARNING|DP=0|PP=1|TP=4|ip-26-0-164-207]: Repo card metadata block was not found. Setting CardData to empty. [default2]:07/03/2024 02:59:31 [WARNING|DP=0|PP=3|TP=10|ip-26-0-174-36]: Repo card metadata block was not found. Setting CardData to empty. [default6]:07/03/2024 02:59:31 [WARNING|DP=0|PP=2|TP=6|ip-26-0-169-139]: Repo card metadata block was not found. Setting CardData to empty. [default7]:07/03/2024 02:59:31 [WARNING|DP=0|PP=1|TP=7|ip-26-0-164-207]: Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default0]:07/03/2024 02:59:31 [WARNING|DP=0|PP=2|TP=0|ip-26-0-169-139]: Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default6]:07/03/2024 02:59:31 [WARNING|DP=0|PP=1|TP=6|ip-26-0-164-207]: Repo card metadata block was not found. Setting CardData to empty. [default3]:07/03/2024 02:59:31 [WARNING|DP=0|PP=1|TP=3|ip-26-0-164-207]: Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default1]:07/03/2024 02:59:31 [WARNING|DP=0|PP=3|TP=1|ip-26-0-173-246]: Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default1]:07/03/2024 02:59:31 [WARNING|DP=0|PP=1|TP=9|ip-26-0-165-24]: Repo card metadata block was not found. Setting CardData to empty. [default3]:07/03/2024 02:59:31 [WARNING|DP=0|PP=1|TP=11|ip-26-0-165-24]: Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default4]:07/03/2024 02:59:31 [WARNING|DP=0|PP=3|TP=12|ip-26-0-174-36]: Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default7]:07/03/2024 02:59:31 [WARNING|DP=0|PP=2|TP=7|ip-26-0-169-139]: Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default6]:07/03/2024 02:59:31 [WARNING|DP=0|PP=1|TP=14|ip-26-0-165-24]: Repo card metadata block was not found. Setting CardData to empty. [default2]:07/03/2024 02:59:31 [WARNING|DP=0|PP=1|TP=2|ip-26-0-164-207]: Repo card metadata block was not found. Setting CardData to empty. [default4]:07/03/2024 02:59:36 [WARNING|DP=0|PP=0|TP=12|ip-26-0-163-147]: Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default7]:[rank15]: Traceback (most recent call last): [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]:[rank15]: trainer.train(dataloader) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default7]:[rank15]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default7]:[rank15]: outputs = self.pipeline_engine.train_batch_iter( [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default7]:[rank15]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default7]:[rank15]: output = model(**micro_batch) [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank15]: return self._call_impl(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank15]: return forward_call(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default7]:[rank15]: sharded_logits = self.model( [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank15]: return self._call_impl(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank15]: return forward_call(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default7]:[rank15]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default7]:[rank15]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank15]: return self._call_impl(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank15]: return forward_call(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default7]:[rank15]: output = self.pp_block(**new_kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank15]: return self._call_impl(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank15]: return forward_call(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 637, in forward [default7]:[rank15]: hidden_states = self.mlp(hidden_states=hidden_states)["hidden_states"] [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank15]: return self._call_impl(*args, **kwargs) [default6]:[rank6]: Traceback (most recent call last): [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank15]: return forward_call(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 172, in forward [default7]:[rank15]: hidden_states = self.down_proj(self.split_silu_mul(merged_states)) [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank15]: return self._call_impl(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank15]: return forward_call(*args, **kwargs) [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/ten[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]:[rank7]: Traceback (most recent call last): [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]:[rank6]: trainer.train(dataloader) sor_parallel/nn.py", line 159, in forward [default7]:[rank15]: return row_linear( [default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 474, in row_linear [default7]:[rank15]: out = F.linear(input, weight, bias) [default7]:[rank15]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB. GPU  has a total capacity of 79.33 GiB of which 1.54 GiB is free. Including non-PyTorch memory, this process has 77.78 GiB memory in use. Of the allocated memory 67.96 GiB is allocated by PyTorch, and 58.84 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default7]:[rank7]: trainer.train(dataloader) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default7]:[rank7]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default7]:[rank7]: outputs = self.pipeline_engine.train_batch_iter( [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default6]:[rank6]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default7]:[rank7]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default7]:[rank7]: output = model(**micro_batch) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank7]: return self._call_impl(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank7]: return forward_call(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default7]:[rank7]: sharded_logits = self.model( [default6]:[rank6]: outputs = self.pipeline_engine.train_batch_iter( [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default6]:[rank6]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default6]:[rank6]: output = model(**micro_batch) [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank7]: return self._call_impl(*args, **kwargs) [default6]:[rank6]: return self._call_impl(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank7]: return forward_call(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default7]:[rank7]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default6]:[rank6]: return forward_call(*args, **kwargs) [default7]:[rank7]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default6]:[rank6]: sharded_logits = self.model( [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank6]: return self._call_impl(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank7]: return self._call_impl(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank6]: return forward_call(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default6]:[rank6]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default7]:[rank7]: return forward_call(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default7]:[rank7]: output = self.pp_block(**new_kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default7]:[rank7]: return self._call_impl(*args, **kwargs) [default6]:[rank6]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank6]: return self._call_impl(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank6]: return forward_call(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank7]: return forward_call(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 637, in forward [default7]:[rank7]: hidden_states = self.mlp(hidden_states=hidden_states)["hidden_states"] [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank14]: Traceback (most recent call last): [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]:[rank14]: trainer.train(dataloader) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default6]:[rank14]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default6]:[rank14]: outputs = self.pipeline_engine.train_batch_iter( [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default6]:[rank14]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanot[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default6]:[rank6]: output = self.pp_block(**new_kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank6]: return self._call_impl(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank7]: return self._call_impl(*args, **kwargs) ron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default6]:[rank14]: output = model(**micro_batch) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank6]: return forward_call(*args, **kwargs) [default6]:[rank14]: return self._call_impl(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank14]: return forward_call(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default6]:[rank14]: sharded_logits = self.model( [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank14]: return self._call_impl(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank14]: return forward_call(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotro[default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl n/src/nanotron/models/llama.py", line 764, in forward [default6]:[rank14]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default6]:[rank14]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank14]: return self._call_impl(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank14]: return forward_call(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 637, in forward [default6]:[rank14]: output = self.pp_block(**new_kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank14]: return self._call_impl(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank14]: return forward_call(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 637, in forward [default6]:[rank14]: hidden_states = self.mlp(hidden_states=hidden_states)["hidden_states"] [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank14]: return self._call_impl(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank14]: return forward_call(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 172, in forward [default6]:[rank14]: hidden_states = self.down_proj(self.split_silu_mul(merged_states)) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank14]: return self._call_impl(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank14]: return forward_call(*args, **kwargs) [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default6]:[rank14]: return row_linear( [default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 474, in row_linear [default6]:[rank14]: out = F.linear(input, weight, bias) [default6]:[rank14]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB. GPU  has a total capacity of 79.33 GiB of which 1.45 GiB is free. Including non-PyTorch memory, this process has 77.87 GiB memory in use. Of the allocated memory 67.96 GiB is allocated by PyTorch, and 58.84 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default0]:[rank8]: Traceback (most recent call last): [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]:[rank8]: trainer.train(dataloader) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default0]:[rank8]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default0]:[rank8]: outputs = self.pipeline_engine.train_batch_iter( [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default0]:[rank8]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default0]:[rank8]: output = model(**micro_batch) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank8]: return self._call_impl(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank8]: return forward_call(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default0]:[rank8]: sharded_logits = self.model( [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank8]: return self._call_impl(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank8]: return forward_call(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default0]:[rank8]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default0]:[rank8]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank8]: return self._call_impl(*args, **kwargs) [default1]:[rank9]: Traceback (most recent call last): [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank9]: trainer.train(dataloader) [default0]:[rank8]: return forward_call(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default1]:[rank9]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]:[rank8]: output = self.pp_block(**new_kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank9]: outputs = self.pipeline_engine.train_batch_iter( [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default1]:[rank9]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]:[rank9]: output = model(**micro_batch) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank9]: return self._call_impl(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank9]: return forward_call(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default1]:[rank9]: sharded_logits = self.model( [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank9]: return self._call_impl(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank9]: return forward_call(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default1]:[rank9]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default1]:[rank9]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default0]:[rank8]: return self._call_impl(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank9]: return self._call_impl(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank8]: return forward_call(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 637, in forward [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank8]: hidden_states = self.mlp(hidden_states=hidden_states)["hidden_states"] [default1]:[rank9]: return forward_call(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default1]:[rank9]: output = self.pp_block(**new_kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank8]: return self._call_impl(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank9]: return self._call_impl(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank9]: return forward_call(*args, **kwargs) [default0]:[rank8]: return forward_call(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 172, in forward [default0]:[rank8]: hidden_states = self.down_proj(self.split_silu_mul(merged_states)) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 637, in forward [default1]:[rank9]: hidden_states = self.mlp(hidden_states=hidden_states)["hidden_states"] [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank8]: return self._call_impl(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank8]: return forward_call(*args, **kwargs) [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default0]:[rank8]: return row_linear( [default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 474, in row_linear [default0]:[rank8]: out = F.linear(input, weight, bias) [default0]:[rank8]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB. GPU [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank9]: return self._call_impl(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank9]: return forward_call(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 172, in forward [default1]:[rank9]: hidden_states = self.down_proj(self.split_silu_mul(merged_states)) [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank9]: return self._call_impl(*args, **kwargs) [default4]:[rank12]: Traceback (most recent call last): [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:[rank9]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank10]: Traceback (most recent call last): [default5]:[rank13]: Traceback (most recent call last): [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]:[rank11]: Traceback (most recent call last): [default1]:[rank9]: return forward_call(*args, **kwargs) [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default1]:[rank9]: return row_linear( [default4]:[rank12]: trainer.train(dataloader) [default2]:[rank10]: trainer.train(dataloader) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 474, in row_linear [default1]:[rank9]: out = F.linear(input, weight, bias) [default2]:[rank10]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default2]:[rank10]: outputs = self.pipeline_engine.train_batch_iter( [default5]:[rank13]: trainer.train(dataloader) [default4]:[rank12]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]:[rank9]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB. GPU  has a total capacity of 79.33 GiB of which 1.54 GiB is free. Including non-PyTorch memory, this process has 77.78 GiB memory in use. Of the allocated memory 67.96 GiB is allocated by PyTorch, and 58.84 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default5]:[rank13]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default3]:[rank11]: trainer.train(dataloader) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default6]:[rank6]: hidden_states = self.mlp(hidden_states=hidden_states)["hidden_states"] [default7]:[rank7]: return forward_call(*args, **kwargs) [default2]:[rank10]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 172, in forward [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default7]:[rank7]: hidden_states = self.down_proj(self.split_silu_mul(merged_states)) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default5]:[rank13]: outputs = self.pipeline_engine.train_batch_iter( [default7]:[rank7]: return self._call_impl(*args, **kwargs) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default6]:[rank6]: return self._call_impl(*args, **kwargs) [default3]:[rank11]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default6]:[rank6]: return forward_call(*args, **kwargs) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 172, in forward [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default7]:[rank7]: return forward_call(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default7]:[rank7]: return row_linear( [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 479, in row_linear [default7]:[rank7]: out = differentiable_reduce_scatter_sum(out, group=group) [default2]:[rank10]: output = model(**micro_batch) [default6]:[rank6]: hidden_states = self.down_proj(self.split_silu_mul(merged_states)) [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/distributed_differentiable_primitives.py", line 145, in differentiable_reduce_scatter_sum [default4]:[rank12]: outputs = self.pipeline_engine.train_batch_iter( [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default6]:[rank6]: return self._call_impl(*args, **kwargs) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default7]:[rank7]: return DifferentiableReduceScatterSum.apply(tensor, group) [default3]:[rank11]: outputs = self.pipeline_engine.train_batch_iter( [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank10]: return self._call_impl(*args, **kwargs) [default7]:[rank7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 598, in apply [default5]:[rank13]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default7]:[rank7]: return super().apply(*args, **kwargs) # type: ignore[misc] [default6]:[rank6]: return forward_call(*args, **kwargs) [default4]:[rank12]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/distributed_differentiable_primitives.py", line 111, in forward [default7]:[rank7]: sharded_tensor = torch.empty( [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default3]:[rank11]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default6]:[rank6]: return row_linear( [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default7]:[rank7]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 128.00 MiB. GPU  has a total capacity of 79.33 GiB of which 39.94 MiB is free. Including non-PyTorch memory, this process has 79.28 GiB memory in use. Of the allocated memory 69.96 GiB is allocated by PyTorch, and 58.84 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 479, in row_linear [default6]:[rank6]: out = differentiable_reduce_scatter_sum(out, group=group) [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/distributed_differentiable_primitives.py", line 145, in differentiable_reduce_scatter_sum [default6]:[rank6]: return DifferentiableReduceScatterSum.apply(tensor, group) [default6]:[rank6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 598, in apply [default6]:[rank6]: return super().apply(*args, **kwargs) # type: ignore[misc] [default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/distributed_differentiable_primitives.py", line 111, in forward [default6]:[rank6]: sharded_tensor = torch.empty( [default1]:[rank1]: Traceback (most recent call last): [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]:[rank6]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 128.00 MiB. GPU  has a total capacity of 79.33 GiB of which 127.94 MiB is free. Including non-PyTorch memory, this process has 79.19 GiB memory in use. Of the allocated memory 69.96 GiB is allocated by PyTorch, and 58.84 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default1]:[rank1]: trainer.train(dataloader) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank1]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default1]:[rank1]: outputs = self.pipeline_engine.train_batch_iter( [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default3]:[rank11]: output = model(**micro_batch) [default1]:[rank1]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]:[rank1]: output = model(**micro_batch) [default5]:[rank13]: output = model(**micro_batch) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank1]: return self._call_impl(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank10]: return forward_call(*args, **kwargs) [default1]:[rank1]: return forward_call(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default1]:[rank1]: sharded_logits = self.model( [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank1]: return self._call_impl(*args, **kwargs) [default4]:[rank12]: output = model(**micro_batch) [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank1]: return forward_call(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default1]:[rank1]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default1]:[rank1]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank1]: return self._call_impl(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank1]: return forward_call(*args, **kwargs) [default2]:[rank10]: sharded_logits = self.model( [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default1]:[rank1]: output = self.pp_block(**new_kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank11]: return self._call_impl(*args, **kwargs) [default1]:[rank1]: return self._call_impl(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank1]: return forward_call(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 637, in forward [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank1]: hidden_states = self.mlp(hidden_states=hidden_states)["hidden_states"] [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank1]: return self._call_impl(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank1]: return forward_call(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 172, in forward [default1]:[rank1]: hidden_states = self.down_proj(self.split_silu_mul(merged_states)) [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default1]:[rank1]: return self._call_impl(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank12]: return self._call_impl(*args, **kwargs) [default1]:[rank1]: return forward_call(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default1]:[rank1]: return row_linear( [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 479, in row_linear [default2]:[rank10]: return self._call_impl(*args, **kwargs) [default1]:[rank1]: out = differentiable_reduce_scatter_sum(out, group=group) [default3]:[rank11]: return forward_call(*args, **kwargs) [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/distributed_differentiable_primitives.py", line 145, in differentiable_reduce_scatter_sum [default1]:[rank1]: return DifferentiableReduceScatterSum.apply(tensor, group) [default1]:[rank1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 598, in apply [default1]:[rank1]: return super().apply(*args, **kwargs) # type: ignore[misc] [default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/distributed_differentiable_primitives.py", line 111, in forward [default1]:[rank1]: sharded_tensor = torch.empty( [default1]:[rank1]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 128.00 MiB. GPU  has a total capacity of 79.33 GiB of which 39.94 MiB is free. Including non-PyTorch memory, this process has 79.28 GiB memory in use. Of the allocated memory 69.96 GiB is allocated by PyTorch, and 58.84 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default2]:[rank2]: Traceback (most recent call last): [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]:[rank2]: trainer.train(dataloader) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default2]:[rank2]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default2]:[rank2]: outputs = self.pipeline_engine.train_batch_iter( [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default2]:[rank2]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default2]:[rank2]: output = model(**micro_batch) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank13]: return self._call_impl(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank2]: return self._call_impl(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank2]: return forward_call(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default2]:[rank2]: sharded_logits = self.model( [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank2]: return self._call_impl(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank2]: return forward_call(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nano[default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward tron/models/llama.py", line 764, in forward [default2]:[rank2]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default2]:[rank2]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank2]: return self._call_impl(*args, **kwargs) [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank2]: return forward_call(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default2]:[rank2]: output = self.pp_block(**new_kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank2]: return self._call_impl(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank2]: return forward_call(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 637, in [default3]:[rank11]: sharded_logits = self.model( forward [default2]:[rank2]: hidden_states = self.mlp(hidden_states=hidden_states)["hidden_states"] [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank2]: return self._call_impl(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank2]: return forward_call(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 172, in forward [default2]:[rank2]: hidden_states = self.down_proj(self.split_silu_mul(merged_states)) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank12]: return forward_call(*args, **kwargs) [default2]:[rank10]: return forward_call(*args, **kwargs) [default2]:[rank2]: return self._call_impl(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank2]: return forward_call(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default2]:[rank2]: return row_linear( [default5]:[rank13]: return forward_call(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 479, in row_linear [default2]:[rank2]: out = differentiable_reduce_scatter_sum(out, group=group) [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/distributed_differentiable_primitives.py", line 145, in differentiable_reduce_scatter_sum [default2]:[rank2]: return DifferentiableReduceScatterSum.apply(tensor, group) [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank11]: return self._call_impl(*args, **kwargs) [default2]:[rank2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 598, in apply [default2]:[rank2]: return super().apply(*args, **kwargs) # type: ignore[misc] [default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/distributed_differentiable_primitives.py", line 111, in forward [default2]:[rank2]: sharded_tensor = torch.empty( [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default2]:[rank2]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 128.00 MiB. GPU  has a total capacity of 79.33 GiB of which 127.94 MiB is free. Including non-PyTorch memory, this process has 79.19 GiB memory in use. Of the allocated memory 69.96 GiB is allocated by PyTorch, and 58.84 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default0]:[rank0]: Traceback (most recent call last): [default5]:[rank5]: Traceback (most recent call last): [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]:[rank5]: trainer.train(dataloader) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default4]:[rank12]: sharded_logits = self.model( [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default5]:[rank5]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]:[rank11]: return forward_call(*args, **kwargs) [default0]:[rank0]: trainer.train(dataloader) [default2]:[rank10]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default5]:[rank5]: outputs = self.pipeline_engine.train_batch_iter( [default5]:[rank13]: sharded_logits = self.model( [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default5]:[rank5]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default5]:[rank5]: output = model(**micro_batch) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank5]: return self._call_impl(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank5]: return forward_call(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default5]:[rank5]: sharded_logits = self.model( [default3]:[rank11]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank5]: return self._call_impl(*args, **kwargs) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank13]: return self._call_impl(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank10]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default0]:[rank0]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default3]:[rank11]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default0]:[rank0]: outputs = self.pipeline_engine.train_batch_iter( [default4]:[rank12]: return self._call_impl(*args, **kwargs) [default5]:[rank5]: return forward_call(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank5]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank0]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default2]:[rank10]: return self._call_impl(*args, **kwargs) [default0]:[rank0]: output = model(**micro_batch) [default5]:[rank5]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default5]:[rank13]: return forward_call(*args, **kwargs) [default3]:[rank11]: return self._call_impl(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank5]: return self._call_impl(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank10]: return forward_call(*args, **kwargs) [default0]:[rank0]: return self._call_impl(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default5]:[rank13]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default0]:[rank0]: return forward_call(*args, **kwargs) [default2]:[rank10]: output = self.pp_block(**new_kwargs) [default5]:[rank5]: return forward_call(*args, **kwargs) [default3]:[rank11]: return forward_call(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default5]:[rank5]: output = self.pp_block(**new_kwargs) [default0]:[rank0]: sharded_logits = self.model( [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank12]: return forward_call(*args, **kwargs) [default0]:[rank0]: return self._call_impl(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank13]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank5]: return self._call_impl(*args, **kwargs) [default2]:[rank10]: return self._call_impl(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank5]: return forward_call(*args, **kwargs) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank13]: return self._call_impl(*args, **kwargs) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank0]: return forward_call(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default4]:[rank12]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default0]:[rank0]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default3]:[rank11]: output = self.pp_block(**new_kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default0]:[rank0]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default5]:[rank13]: return forward_call(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 637, in forward [default3]:[rank11]: return self._call_impl(*args, **kwargs) [default5]:[rank5]: hidden_states = self.mlp(hidden_states=hidden_states)["hidden_states"] [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default5]:[rank5]: return self._call_impl(*args, **kwargs) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank0]: return self._call_impl(*args, **kwargs) [default3]:[rank11]: return forward_call(*args, **kwargs) [default4]:[rank12]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank5]: return forward_call(*args, **kwargs) [default2]:[rank10]: return forward_call(*args, **kwargs) [default0]:[rank0]: return forward_call(*args, **kwargs) [default5]:[rank13]: output = self.pp_block(**new_kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 637, in forward [default0]:[rank0]: output = self.pp_block(**new_kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 637, in forward [default0]:[rank0]: return self._call_impl(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank0]: return forward_call(*args, **kwargs) [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 172, in forward [default3]:[rank11]: hidden_states = self.mlp(hidden_states=hidden_states)["hidden_states"] [default5]:[rank13]: return self._call_impl(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 637, in forward [default4]:[rank12]: return self._call_impl(*args, **kwargs) [default5]:[rank5]: hidden_states = self.down_proj(self.split_silu_mul(merged_states)) [default2]:[rank10]: hidden_states = self.mlp(hidden_states=hidden_states)["hidden_states"] [default0]:[rank0]: hidden_states = self.mlp(hidden_states=hidden_states)["hidden_states"] [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank0]: return self._call_impl(*args, **kwargs) [default3]:[rank11]: return self._call_impl(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank12]: return forward_call(*args, **kwargs) [default5]:[rank5]: return self._call_impl(*args, **kwargs) [default5]:[rank13]: return forward_call(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank0]: return forward_call(*args, **kwargs) [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default5]:[rank5]: return forward_call(*args, **kwargs) [default4]:[rank12]: output = self.pp_block(**new_kwargs) [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank10]: return self._call_impl(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 172, in forward [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank5]: return row_linear( [default0]:[rank0]: hidden_states = self.down_proj(self.split_silu_mul(merged_states)) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank12]: return self._call_impl(*args, **kwargs) [default0]:[rank0]: return self._call_impl(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 637, in forward [default2]:[rank10]: return forward_call(*args, **kwargs) [default3]:[rank11]: return forward_call(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default0]:[rank0]: return forward_call(*args, **kwargs) [default5]:[rank13]: hidden_states = self.mlp(hidden_states=hidden_states)["hidden_states"] [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 172, in forward [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 479, in row_linear [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 172, in forward [default5]:[rank5]: out = differentiable_reduce_scatter_sum(out, group=group) [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank12]: return forward_call(*args, **kwargs) [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/distributed_differentiable_primitives.py", line 145, in differentiable_reduce_scatter_sum [default0]:[rank0]: return row_linear( [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 479, in row_linear [default2]:[rank10]: hidden_states = self.down_proj(self.split_silu_mul(merged_states)) [default5]:[rank5]: return DifferentiableReduceScatterSum.apply(tensor, group) [default3]:[rank11]: hidden_states = self.down_proj(self.split_silu_mul(merged_states)) [default5]:[rank5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 598, in apply [default5]:[rank5]: return super().apply(*args, **kwargs) # type: ignore[misc] [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 637, in forward [default0]:[rank0]: out = differentiable_reduce_scatter_sum(out, group=group) [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/distributed_differentiable_primitives.py", line 111, in forward [default5]:[rank5]: sharded_tensor = torch.empty( [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/distributed_differentiable_primitives.py", line 145, in differentiable_reduce_scatter_sum [default4]:[rank12]: hidden_states = self.mlp(hidden_states=hidden_states)["hidden_states"] [default0]:[rank0]: return DifferentiableReduceScatterSum.apply(tensor, group) [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank5]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 128.00 MiB. GPU  has a total capacity of 79.33 GiB of which 39.94 MiB is free. Including non-PyTorch memory, this process has 79.28 GiB memory in use. Of the allocated memory 69.96 GiB is allocated by PyTorch, and 58.84 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default5]:[rank13]: return self._call_impl(*args, **kwargs) [default0]:[rank0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 598, in apply [default0]:[rank0]: return super().apply(*args, **kwargs) # type: ignore[misc] [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/distributed_differentiable_primitives.py", line 111, in forward [default0]:[rank0]: sharded_tensor = torch.empty( [default0]:[rank0]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 128.00 MiB. GPU [default3]:[rank11]: return self._call_impl(*args, **kwargs) [default4]:[rank4]: Traceback (most recent call last): [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]:[rank12]: return self._call_impl(*args, **kwargs) [default4]:[rank4]: trainer.train(dataloader) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default4]:[rank4]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank4]: outputs = self.pipeline_engine.train_batch_iter( [default3]:[rank3]: Traceback (most recent call last): [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]:[rank11]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default2]:[rank10]: return self._call_impl(*args, **kwargs) [default3]:[rank3]: trainer.train(dataloader) [default5]:[rank13]: return forward_call(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 429, in train [default3]:[rank3]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 462, in training_step [default3]:[rank3]: outputs = self.pipeline_engine.train_batch_iter( [default2]:[rank10]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 172, in forward [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default4]:[rank4]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default4]:[rank12]: return forward_call(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default5]:[rank13]: hidden_states = self.down_proj(self.split_silu_mul(merged_states)) [default4]:[rank4]: output = model(**micro_batch) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 172, in forward [default4]:[rank4]: return self._call_impl(*args, **kwargs) [default3]:[rank11]: return forward_call(*args, **kwargs) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 252, in train_batch_iter [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank3]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default4]:[rank12]: hidden_states = self.down_proj(self.split_silu_mul(merged_states)) [default3]:[rank11]: return row_linear( [default3]:[rank3]: output = model(**micro_batch) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank13]: return self._call_impl(*args, **kwargs) [default3]:[rank3]: return self._call_impl(*args, **kwargs) [default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 474, in row_linear [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank3]: return forward_call(*args, **kwargs) [default5]:[rank13]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default3]:[rank3]: sharded_logits = self.model( [default3]:[rank11]: out = F.linear(input, weight, bias) [default4]:[rank4]: return forward_call(*args, **kwargs) [default5]:[rank13]: return forward_call(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 891, in forward [default4]:[rank4]: sharded_logits = self.model( [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank4]: return self._call_impl(*args, **kwargs) [default3]:[rank11]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB. GPU  has a total capacity of 79.33 GiB of which 1.54 GiB is free. Including non-PyTorch memory, this process has 77.78 GiB memory in use. Of the allocated memory 67.96 GiB is allocated by PyTorch, and 58.84 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank4]: return forward_call(*args, **kwargs) [default4]:[rank12]: return self._call_impl(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default4]:[rank4]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default2]:[rank10]: return forward_call(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default5]:[rank13]: return row_linear( [default4]:[rank12]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank4]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 474, in row_linear [default4]:[rank4]: return self._call_impl(*args, **kwargs) [default4]:[rank12]: return forward_call(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank4]: return forward_call(*args, **kwargs) [default2]:[rank10]: return row_linear( [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default5]:[rank13]: out = F.linear(input, weight, bias) [default4]:[rank4]: output = self.pp_block(**new_kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank4]: return self._call_impl(*args, **kwargs) [default4]:[rank12]: return row_linear( [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank3]: return self._call_impl(*args, **kwargs) [default5]:[rank13]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB. GPU  has a total capacity of 79.33 GiB of which 1.54 GiB is free. Including non-PyTorch memory, this process has 77.78 GiB memory in use. Of the allocated memory 67.96 GiB is allocated by PyTorch, and 58.84 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default4]:[rank4]: return forward_call(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 637, in forward [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank4]: hidden_states = self.mlp(hidden_states=hidden_states)["hidden_states"] [default3]:[rank3]: return forward_call(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank4]: return self._call_impl(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 474, in row_linear [default4]:[rank4]: return forward_call(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 172, in forward [default4]:[rank12]: out = F.linear(input, weight, bias) [default3]:[rank3]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 474, in row_linear [default2]:[rank10]: out = F.linear(input, weight, bias) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]:[rank3]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default4]:[rank12]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB. GPU  has a total capacity of 79.33 GiB of which 1.45 GiB is free. Including non-PyTorch memory, this process has 77.87 GiB memory in use. Of the allocated memory 67.96 GiB is allocated by PyTorch, and 58.84 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default2]:[rank10]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB. GPU  has a total capacity of 79.33 GiB of which 1.45 GiB is free. Including non-PyTorch memory, this process has 77.87 GiB memory in use. Of the allocated memory 67.96 GiB is allocated by PyTorch, and 58.84 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default4]:[rank4]: hidden_states = self.down_proj(self.split_silu_mul(merged_states)) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank4]: return self._call_impl(*args, **kwargs) [default3]:[rank3]: return self._call_impl(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default4]:[rank4]: return forward_call(*args, **kwargs) [default3]:[rank3]: return forward_call(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default4]:[rank4]: return row_linear( [default3]:[rank3]: output = self.pp_block(**new_kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 479, in row_linear [default4]:[rank4]: out = differentiable_reduce_scatter_sum(out, group=group) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/distributed_differentiable_primitives.py", line 145, in differentiable_reduce_scatter_sum [default4]:[rank4]: return DifferentiableReduceScatterSum.apply(tensor, group) [default4]:[rank4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 598, in apply [default3]:[rank3]: return self._call_impl(*args, **kwargs) [default4]:[rank4]: return super().apply(*args, **kwargs) # type: ignore[misc] [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank3]: return forward_call(*args, **kwargs) [default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/distributed_differentiable_primitives.py", line 111, in forward [default4]:[rank4]: sharded_tensor = torch.empty( [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 637, in forward [default3]:[rank3]: hidden_states = self.mlp(hidden_states=hidden_states)["hidden_states"] [default4]:[rank4]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 128.00 MiB. GPU  has a total capacity of 79.33 GiB of which 127.94 MiB is free. Including non-PyTorch memory, this process has 79.19 GiB memory in use. Of the allocated memory 69.96 GiB is allocated by PyTorch, and 58.84 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank3]: return self._call_impl(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank3]: return forward_call(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 172, in forward [default3]:[rank3]: hidden_states = self.down_proj(self.split_silu_mul(merged_states)) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [default3]:[rank3]: return self._call_impl(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [default3]:[rank3]: return forward_call(*args, **kwargs) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default3]:[rank3]: return row_linear( [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 479, in row_linear [default3]:[rank3]: out = differentiable_reduce_scatter_sum(out, group=group) [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/distributed_differentiable_primitives.py", line 145, in differentiable_reduce_scatter_sum [default3]:[rank3]: return DifferentiableReduceScatterSum.apply(tensor, group) [default3]:[rank3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 598, in apply [default3]:[rank3]: return super().apply(*args, **kwargs) # type: ignore[misc] [default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/distributed_differentiable_primitives.py", line 111, in forward [default3]:[rank3]: sharded_tensor = torch.empty( [default3]:[rank3]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 128.00 MiB. GPU  has a total capacity of 79.33 GiB of which 39.94 MiB is free. Including non-PyTorch memory, this process has 79.28 GiB memory in use. Of the allocated memory 69.96 GiB is allocated by PyTorch, and 58.84 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default5]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default5]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default3]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default3]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default4]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default4]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default0]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default0]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default4]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default4]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default7]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default7]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default0]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default0]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default3]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default3]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default6]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default6]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default2]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default2]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default6]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default6]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default2]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default2]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default7]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default7]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default1]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default1]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default5]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default5]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [default1]:/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [default1]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass W0703 02:59:54.677000 140334638495552 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1657033 closing signal SIGTERM W0703 02:59:54.678000 140334638495552 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1657035 closing signal SIGTERM E0703 02:59:55.497000 140334638495552 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 1657032) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-07-03_02:59:54 host : ip-26-0-162-233.ec2.internal rank : 2 (local_rank: 2) exitcode : 1 (pid: 1657034) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-07-03_02:59:54 host : ip-26-0-162-233.ec2.internal rank : 4 (local_rank: 4) exitcode : 1 (pid: 1657036) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2024-07-03_02:59:54 host : ip-26-0-162-233.ec2.internal rank : 5 (local_rank: 5) exitcode : 1 (pid: 1657037) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [4]: time : 2024-07-03_02:59:54 host : ip-26-0-162-233.ec2.internal rank : 6 (local_rank: 6) exitcode : 1 (pid: 1657038) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [5]: time : 2024-07-03_02:59:54 host : ip-26-0-162-233.ec2.internal rank : 7 (local_rank: 7) exitcode : 1 (pid: 1657039) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-03_02:59:54 host : ip-26-0-162-233.ec2.internal rank : 0 (local_rank: 0) exitcode : 1 (pid: 1657032) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ srun: error: ip-26-0-162-233: task 0: Exited with exit code 1 W0703 02:59:58.669000 140146853848832 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-165-24.ec2.internal_893718_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 02:59:59.387000 140287186298624 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-163-147.ec2.internal_791043_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 02:59:59.428000 140426170537728 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-173-246.ec2.internal_319266_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 02:59:59.465000 140248137180928 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-169-247.ec2.internal_320736_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 02:59:59.482000 140156376786688 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-164-207.ec2.internal_404303_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 02:59:59.587000 139959053137664 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-174-36.ec2.internal_833164_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 02:59:59.637000 140565730084608 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-169-139.ec2.internal_202106_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 02:59:59.670000 140253797914432 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 320805 closing signal SIGTERM W0703 02:59:59.670000 140253797914432 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 320806 closing signal SIGTERM W0703 02:59:59.670000 140253797914432 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 320807 closing signal SIGTERM W0703 02:59:59.670000 140253797914432 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 320808 closing signal SIGTERM W0703 02:59:59.672000 140253797914432 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 320809 closing signal SIGTERM W0703 02:59:59.672000 140253797914432 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 320810 closing signal SIGTERM W0703 02:59:59.672000 140253797914432 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 320811 closing signal SIGTERM W0703 02:59:59.674000 140253797914432 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 320812 closing signal SIGTERM W0703 02:59:59.685000 140431831271232 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 319335 closing signal SIGTERM W0703 02:59:59.685000 140431831271232 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 319336 closing signal SIGTERM W0703 02:59:59.685000 140431831271232 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 319337 closing signal SIGTERM W0703 02:59:59.686000 140152514582336 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 893787 closing signal SIGTERM W0703 02:59:59.687000 140152514582336 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 893788 closing signal SIGTERM W0703 02:59:59.687000 140152514582336 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 893789 closing signal SIGTERM W0703 02:59:59.687000 140152514582336 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 893790 closing signal SIGTERM W0703 02:59:59.685000 140431831271232 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 319338 closing signal SIGTERM W0703 02:59:59.687000 139964713871168 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 833233 closing signal SIGTERM W0703 02:59:59.687000 139964713871168 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 833234 closing signal SIGTERM W0703 02:59:59.686000 140571390818112 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 202174 closing signal SIGTERM W0703 02:59:59.686000 140571390818112 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 202175 closing signal SIGTERM W0703 02:59:59.688000 139964713871168 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 833235 closing signal SIGTERM W0703 02:59:59.686000 140571390818112 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 202176 closing signal SIGTERM W0703 02:59:59.688000 139964713871168 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 833236 closing signal SIGTERM W0703 02:59:59.688000 140431831271232 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 319339 closing signal SIGTERM W0703 02:59:59.688000 140162037520192 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 404373 closing signal SIGTERM W0703 02:59:59.689000 139964713871168 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 833237 closing signal SIGTERM W0703 02:59:59.687000 140571390818112 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 202177 closing signal SIGTERM W0703 02:59:59.689000 140162037520192 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 404374 closing signal SIGTERM W0703 02:59:59.689000 140162037520192 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 404375 closing signal SIGTERM W0703 02:59:59.689000 140162037520192 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 404376 closing signal SIGTERM W0703 02:59:59.688000 140571390818112 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 202178 closing signal SIGTERM W0703 02:59:59.688000 140571390818112 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 202179 closing signal SIGTERM W0703 02:59:59.691000 140152514582336 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 893791 closing signal SIGTERM W0703 02:59:59.691000 140152514582336 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 893792 closing signal SIGTERM W0703 02:59:59.691000 140152514582336 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 893793 closing signal SIGTERM W0703 02:59:59.690000 140431831271232 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 319340 closing signal SIGTERM W0703 02:59:59.690000 140431831271232 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 319341 closing signal SIGTERM W0703 02:59:59.691000 139964713871168 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 833238 closing signal SIGTERM W0703 02:59:59.691000 139964713871168 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 833239 closing signal SIGTERM W0703 02:59:59.690000 140431831271232 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 319342 closing signal SIGTERM W0703 02:59:59.691000 139964713871168 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 833240 closing signal SIGTERM W0703 02:59:59.691000 140152514582336 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 893794 closing signal SIGTERM W0703 02:59:59.690000 140571390818112 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 202180 closing signal SIGTERM W0703 02:59:59.691000 140571390818112 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 202181 closing signal SIGTERM W0703 02:59:59.693000 140162037520192 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 404377 closing signal SIGTERM W0703 02:59:59.693000 140162037520192 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 404378 closing signal SIGTERM W0703 02:59:59.693000 140162037520192 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 404379 closing signal SIGTERM W0703 02:59:59.694000 140162037520192 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 404380 closing signal SIGTERM E0703 02:59:59.800000 140292847032128 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 791112) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 W0703 02:59:59.806000 140292847032128 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-163-147.ec2.internal_791043_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 02:59:59.832000 140292847032128 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-163-147.ec2.internal_791043_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 02:59:59.868000 140292847032128 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-163-147.ec2.internal_791043_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-07-03_02:59:59 host : ip-26-0-163-147.ec2.internal rank : 9 (local_rank: 1) exitcode : 1 (pid: 791113) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-07-03_02:59:59 host : ip-26-0-163-147.ec2.internal rank : 10 (local_rank: 2) exitcode : 1 (pid: 791114) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2024-07-03_02:59:59 host : ip-26-0-163-147.ec2.internal rank : 11 (local_rank: 3) exitcode : 1 (pid: 791115) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [4]: time : 2024-07-03_02:59:59 host : ip-26-0-163-147.ec2.internal rank : 12 (local_rank: 4) exitcode : 1 (pid: 791116) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [5]: time : 2024-07-03_02:59:59 host : ip-26-0-163-147.ec2.internal rank : 13 (local_rank: 5) exitcode : 1 (pid: 791117) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [6]: time : 2024-07-03_02:59:59 host : ip-26-0-163-147.ec2.internal rank : 14 (local_rank: 6) exitcode : 1 (pid: 791118) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [7]: time : 2024-07-03_02:59:59 host : ip-26-0-163-147.ec2.internal rank : 15 (local_rank: 7) exitcode : 1 (pid: 791119) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-03_02:59:59 host : ip-26-0-163-147.ec2.internal rank : 8 (local_rank: 0) exitcode : 1 (pid: 791112) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ srun: error: ip-26-0-163-147: task 1: Exited with exit code 1 W0703 03:00:03.674000 140146853848832 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-165-24.ec2.internal_893718_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 03:00:04.432000 140426170537728 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-173-246.ec2.internal_319266_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 03:00:04.469000 140248137180928 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-169-247.ec2.internal_320736_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 03:00:04.487000 140156376786688 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-164-207.ec2.internal_404303_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 03:00:04.591000 139959053137664 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-174-36.ec2.internal_833164_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 03:00:04.642000 140565730084608 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-169-139.ec2.internal_202106_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 03:00:08.678000 140146853848832 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-165-24.ec2.internal_893718_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 03:00:08.924000 139964713871168 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-174-36.ec2.internal_833164_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 03:00:08.931000 139964713871168 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-174-36.ec2.internal_833164_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store return getattr(self._store, store_op)(*args, **kwargs) torch.distributed.DistNetworkError: Broken pipe The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 254, in launch_agent result = agent.run() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper result = f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 733, in run result = self._invoke_run(role) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 908, in _invoke_run num_nodes_waiting = rdzv_handler.num_nodes_waiting() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1174, in num_nodes_waiting self._state_holder.sync() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 419, in sync get_response = self._backend.get_state() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state base64_state: bytes = self._call_store("get", self._key) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store raise RendezvousConnectionError( torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details. srun: error: ip-26-0-174-36: task 7: Exited with exit code 1 W0703 03:00:09.436000 140426170537728 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-173-246.ec2.internal_319266_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 03:00:09.473000 140248137180928 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-169-247.ec2.internal_320736_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 03:00:09.492000 140156376786688 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-164-207.ec2.internal_404303_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 03:00:09.646000 140565730084608 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1252] The node 'ip-26-0-169-139.ec2.internal_202106_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 03:00:12.418000 140431831271232 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-173-246.ec2.internal_319266_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 03:00:12.425000 140431831271232 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-173-246.ec2.internal_319266_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store return getattr(self._store, store_op)(*args, **kwargs) torch.distributed.DistNetworkError: Broken pipe The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 254, in launch_agent result = agent.run() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper result = f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 733, in run result = self._invoke_run(role) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 908, in _invoke_run num_nodes_waiting = rdzv_handler.num_nodes_waiting() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1174, in num_nodes_waiting self._state_holder.sync() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 419, in sync get_response = self._backend.get_state() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state base64_state: bytes = self._call_store("get", self._key) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store raise RendezvousConnectionError( torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details. W0703 03:00:12.622000 140253797914432 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-169-247.ec2.internal_320736_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 03:00:12.629000 140253797914432 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-169-247.ec2.internal_320736_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store return getattr(self._store, store_op)(*args, **kwargs) torch.distributed.DistNetworkError: Broken pipe The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 254, in launch_agent result = agent.run() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper result = f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 733, in run result = self._invoke_run(role) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 908, in _invoke_run num_nodes_waiting = rdzv_handler.num_nodes_waiting() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1174, in num_nodes_waiting self._state_holder.sync() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 419, in sync get_response = self._backend.get_state() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state base64_state: bytes = self._call_store("get", self._key) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store raise RendezvousConnectionError( torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details. srun: error: ip-26-0-173-246: task 6: Exited with exit code 1 W0703 03:00:12.828000 140162037520192 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-164-207.ec2.internal_404303_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 03:00:12.836000 140162037520192 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-164-207.ec2.internal_404303_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store return getattr(self._store, store_op)(*args, **kwargs) torch.distributed.DistNetworkError: Broken pipe The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 254, in launch_agent result = agent.run() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper result = f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 733, in run result = self._invoke_run(role) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 908, in _invoke_run num_nodes_waiting = rdzv_handler.num_nodes_waiting() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1174, in num_nodes_waiting self._state_holder.sync() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 419, in sync get_response = self._backend.get_state() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state base64_state: bytes = self._call_store("get", self._key) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store raise RendezvousConnectionError( torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details. srun: error: ip-26-0-169-247: task 5: Exited with exit code 1 W0703 03:00:13.024000 140571390818112 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-169-139.ec2.internal_202106_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 03:00:13.032000 140571390818112 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-169-139.ec2.internal_202106_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store return getattr(self._store, store_op)(*args, **kwargs) torch.distributed.DistNetworkError: Broken pipe The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 254, in launch_agent result = agent.run() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper result = f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 733, in run result = self._invoke_run(role) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 908, in _invoke_run num_nodes_waiting = rdzv_handler.num_nodes_waiting() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1174, in num_nodes_waiting self._state_holder.sync() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 419, in sync get_response = self._backend.get_state() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state base64_state: bytes = self._call_store("get", self._key) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store raise RendezvousConnectionError( torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details. srun: error: ip-26-0-169-139: task 4: Exited with exit code 1 srun: error: ip-26-0-164-207: task 2: Exited with exit code 1 W0703 03:00:13.625000 140152514582336 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-165-24.ec2.internal_893718_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. W0703 03:00:13.632000 140152514582336 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1203] The node 'ip-26-0-165-24.ec2.internal_893718_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store return getattr(self._store, store_op)(*args, **kwargs) torch.distributed.DistNetworkError: Broken pipe The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 254, in launch_agent result = agent.run() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper result = f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 733, in run result = self._invoke_run(role) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 908, in _invoke_run num_nodes_waiting = rdzv_handler.num_nodes_waiting() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1174, in num_nodes_waiting self._state_holder.sync() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 419, in sync get_response = self._backend.get_state() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state base64_state: bytes = self._call_store("get", self._key) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store raise RendezvousConnectionError( torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details. srun: error: ip-26-0-165-24: task 3: Exited with exit code 1 Consider using `hf_transfer` for faster uploads. This solution comes with some limitations. See https://huggingface.co/docs/huggingface_hub/hf_transfer for more details.