3outeille's picture
3outeille HF staff
Upload llama-1B/64_GPUS/dp-1_tp-8_pp-8_mbz-4
df31cb6 verified
raw
history blame
131 kB
========================
START TIME: Sat Jul 6 09:41:00 UTC 2024
python3 version = Python 3.10.14
========================
The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /admin/home/ferdinand_mom/.cache/huggingface/token
Login successful
Already on 'bench_cluster'
M examples/config_tiny_llama.py
M examples/config_tiny_llama.yaml
M examples/train_tiny_llama.sh
Your branch is up to date with 'origin/bench_cluster'.
Job status: RUNNING
[2024-07-06 09:41:03,171] torch.distributed.run: [WARNING]
[2024-07-06 09:41:03,171] torch.distributed.run: [WARNING] *****************************************
[2024-07-06 09:41:03,171] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-07-06 09:41:03,171] torch.distributed.run: [WARNING] *****************************************
[2024-07-06 09:41:03,172] torch.distributed.run: [WARNING]
[2024-07-06 09:41:03,172] torch.distributed.run: [WARNING] *****************************************
[2024-07-06 09:41:03,172] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-07-06 09:41:03,172] torch.distributed.run: [WARNING] *****************************************
[2024-07-06 09:41:03,187] torch.distributed.run: [WARNING]
[2024-07-06 09:41:03,187] torch.distributed.run: [WARNING] *****************************************
[2024-07-06 09:41:03,187] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-07-06 09:41:03,187] torch.distributed.run: [WARNING] *****************************************
[2024-07-06 09:41:03,203] torch.distributed.run: [WARNING]
[2024-07-06 09:41:03,203] torch.distributed.run: [WARNING] *****************************************
[2024-07-06 09:41:03,203] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-07-06 09:41:03,203] torch.distributed.run: [WARNING] *****************************************
[2024-07-06 09:41:03,205] torch.distributed.run: [WARNING]
[2024-07-06 09:41:03,205] torch.distributed.run: [WARNING] *****************************************
[2024-07-06 09:41:03,205] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-07-06 09:41:03,205] torch.distributed.run: [WARNING] *****************************************
[2024-07-06 09:41:03,207] torch.distributed.run: [WARNING]
[2024-07-06 09:41:03,207] torch.distributed.run: [WARNING] *****************************************
[2024-07-06 09:41:03,207] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-07-06 09:41:03,207] torch.distributed.run: [WARNING] *****************************************
[2024-07-06 09:41:03,221] torch.distributed.run: [WARNING]
[2024-07-06 09:41:03,221] torch.distributed.run: [WARNING] *****************************************
[2024-07-06 09:41:03,221] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-07-06 09:41:03,221] torch.distributed.run: [WARNING] *****************************************
[2024-07-06 09:41:03,272] torch.distributed.run: [WARNING]
[2024-07-06 09:41:03,272] torch.distributed.run: [WARNING] *****************************************
[2024-07-06 09:41:03,272] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-07-06 09:41:03,272] torch.distributed.run: [WARNING] *****************************************
[default0]:07/06/2024 09:41:22 [WARNING|DP=0|PP=0|TP=0|ip-26-0-161-221]: [Vocab Size Padding] Padded vocab (size: 50257) with 7 dummy tokens (new size: 50264)
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: Config:
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: Config(general=GeneralArgs(project='bench_cluster',
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: run='%date_%jobid',
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: seed=42,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: step=None,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: consumed_train_samples=None,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: benchmark_csv_path=None,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: ignore_sanity_checks=True),
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: parallelism=ParallelismArgs(dp=1,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: pp=8,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: tp=8,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: pp_engine=<nanotron.parallel.pipeline_parallel.engine.AllForwardAllBackwardPipelineEngine object at 0x7f393da30700>,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: tp_mode=<TensorParallelLinearMode.REDUCE_SCATTER: 2>,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: tp_linear_async_communication=False,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: expert_parallel_size=1),
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: model=ModelArgs(model_config=LlamaConfig(bos_token_id=1,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: eos_token_id=2,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: hidden_act='silu',
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: hidden_size=2048,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: initializer_range=0.02,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: intermediate_size=4096,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: is_llama_config=True,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: max_position_embeddings=4096,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: num_attention_heads=32,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: num_hidden_layers=24,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: num_key_value_heads=32,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: pad_token_id=None,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: pretraining_tp=1,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: rms_norm_eps=1e-05,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: rope_scaling=None,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: rope_theta=10000.0,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: tie_word_embeddings=True,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: use_cache=True,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: vocab_size=50264),
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: init_method=RandomInit(std=0.025),
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: dtype=torch.bfloat16,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: make_vocab_size_divisible_by=1,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: ddp_bucket_cap_mb=25),
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: tokenizer=TokenizerArgs(tokenizer_name_or_path='openai-community/gpt2',
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: tokenizer_revision=None,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: tokenizer_max_length=None),
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: checkpoints=CheckpointsArgs(checkpoints_path=PosixPath('/dev/null'),
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: checkpoint_interval=100000,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: save_initial_state=False,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: resume_checkpoint_path=None,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: checkpoints_path_is_shared_file_system=False),
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: logging=LoggingArgs(log_level='info',
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: log_level_replica='info',
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: iteration_step_info_interval=1),
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: tokens=TokensArgs(sequence_length=4096,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: train_steps=20,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: micro_batch_size=4,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: batch_accumulation_per_replica=256,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: val_check_interval=-1,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: limit_val_batches=0,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: limit_test_batches=0),
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: optimizer=OptimizerArgs(optimizer_factory=AdamWOptimizerArgs(adam_eps=1e-08,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: adam_beta1=0.9,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: adam_beta2=0.95,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: torch_adam_is_fused=True,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: name='adamW'),
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: zero_stage=1,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: weight_decay=0.01,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: clip_grad=1.0,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: accumulate_grad_in_fp32=True,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: learning_rate_scheduler=LRSchedulerArgs(learning_rate=0.0001,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: lr_warmup_steps=1,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: lr_warmup_style='linear',
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: lr_decay_style='linear',
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: lr_decay_steps=19,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: lr_decay_starting_step=None,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: min_decay_lr=1e-05)),
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: data_stages=[DatasetStageArgs(name='Training Stage',
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: start_training_step=1,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: data=DataArgs(dataset=PretrainDatasetsArgs(hf_dataset_or_datasets='roneneldan/TinyStories',
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: hf_dataset_splits='train',
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: hf_dataset_config_name=None,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: dataset_processing_num_proc_per_process=64,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: dataset_overwrite_cache=False,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: text_column_name='text'),
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: seed=42,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: num_loading_workers=0))],
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: profiler=ProfilerArgs(profiler_export_path=PosixPath('/fsx/ferdinandmom/ferdinand-hf/bench_cluster/results/llama-1B/64_GPUS/dp-1_tp-8_pp-8_mbz-4')),
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: lighteval=None)
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: Model Config:
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: LlamaConfig(bos_token_id=1,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: eos_token_id=2,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: hidden_act='silu',
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: hidden_size=2048,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: initializer_range=0.02,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: intermediate_size=4096,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: is_llama_config=True,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: max_position_embeddings=4096,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: num_attention_heads=32,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: num_hidden_layers=24,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: num_key_value_heads=32,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: pad_token_id=None,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: pretraining_tp=1,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: rms_norm_eps=1e-05,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: rope_scaling=None,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: rope_theta=10000.0,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: tie_word_embeddings=True,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: use_cache=True,
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: vocab_size=50264)
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: Building model..
[default0]:07/06/2024 09:41:22 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: Setting PP block ranks...
[default2]:07/06/2024 09:41:40 [INFO|DP=0|PP=2|TP=2|ip-26-0-162-180]: Local number of parameters: 15.7M (30.02MiB)
[default2]:07/06/2024 09:41:40 [INFO|DP=0|PP=2|TP=2|ip-26-0-162-180]: [After model building] Memory usage: 33.03MiB. Peak allocated: 35.06MiB Peak reserved: 50.00MiB
[default2]:07/06/2024 09:41:40 [INFO|DP=0|PP=2|TP=2|ip-26-0-162-180]: No checkpoint path provided.
[default2]:07/06/2024 09:41:40 [INFO|DP=0|PP=7|TP=2|ip-26-0-166-244]: Local number of parameters: 12.9M (24.55MiB)
[default2]:07/06/2024 09:41:40 [INFO|DP=0|PP=7|TP=2|ip-26-0-166-244]: [After model building] Memory usage: 24.56MiB. Peak allocated: 24.58MiB Peak reserved: 28.00MiB
[default2]:07/06/2024 09:41:40 [INFO|DP=0|PP=7|TP=2|ip-26-0-166-244]: No checkpoint path provided.
[default7]:07/06/2024 09:41:40 [INFO|DP=0|PP=7|TP=7|ip-26-0-166-244]: Local number of parameters: 12.9M (24.55MiB)
[default7]:07/06/2024 09:41:40 [INFO|DP=0|PP=7|TP=7|ip-26-0-166-244]: [After model building] Memory usage: 24.56MiB. Peak allocated: 24.58MiB Peak reserved: 28.00MiB
[default7]:07/06/2024 09:41:40 [INFO|DP=0|PP=7|TP=7|ip-26-0-166-244]: No checkpoint path provided.
[default0]:07/06/2024 09:41:40 [INFO|DP=0|PP=4|TP=0|ip-26-0-162-79]: Local number of parameters: 15.7M (30.02MiB)
[default7]:07/06/2024 09:41:40 [INFO|DP=0|PP=4|TP=7|ip-26-0-162-79]: Local number of parameters: 15.7M (30.02MiB)
[default7]:07/06/2024 09:41:40 [INFO|DP=0|PP=4|TP=7|ip-26-0-162-79]: [After model building] Memory usage: 33.03MiB. Peak allocated: 35.06MiB Peak reserved: 50.00MiB
[default7]:07/06/2024 09:41:40 [INFO|DP=0|PP=4|TP=7|ip-26-0-162-79]: No checkpoint path provided.
[default0]:07/06/2024 09:41:40 [INFO|DP=0|PP=4|TP=0|ip-26-0-162-79]: [After model building] Memory usage: 33.03MiB. Peak allocated: 35.06MiB Peak reserved: 50.00MiB
[default6]:07/06/2024 09:41:40 [INFO|DP=0|PP=4|TP=6|ip-26-0-162-79]: Local number of parameters: 15.7M (30.02MiB)
[default7]:07/06/2024 09:41:40 [INFO|DP=0|PP=3|TP=7|ip-26-0-162-46]: Local number of parameters: 21M (40.03MiB)
[default7]:07/06/2024 09:41:40 [INFO|DP=0|PP=3|TP=7|ip-26-0-162-46]: [After model building] Memory usage: 44.04MiB. Peak allocated: 46.07MiB Peak reserved: 52.00MiB
[default7]:07/06/2024 09:41:40 [INFO|DP=0|PP=3|TP=7|ip-26-0-162-46]: No checkpoint path provided.
[default6]:07/06/2024 09:41:40 [INFO|DP=0|PP=4|TP=6|ip-26-0-162-79]: [After model building] Memory usage: 33.03MiB. Peak allocated: 35.06MiB Peak reserved: 50.00MiB
[default0]:07/06/2024 09:41:40 [INFO|DP=0|PP=4|TP=0|ip-26-0-162-79]: No checkpoint path provided.
[default5]:07/06/2024 09:41:40 [INFO|DP=0|PP=3|TP=5|ip-26-0-162-46]: Local number of parameters: 21M (40.03MiB)
[default5]:07/06/2024 09:41:40 [INFO|DP=0|PP=3|TP=5|ip-26-0-162-46]: [After model building] Memory usage: 44.04MiB. Peak allocated: 46.07MiB Peak reserved: 52.00MiB
[default5]:07/06/2024 09:41:40 [INFO|DP=0|PP=3|TP=5|ip-26-0-162-46]: No checkpoint path provided.
[default3]:07/06/2024 09:41:40 [INFO|DP=0|PP=2|TP=3|ip-26-0-162-180]: Local number of parameters: 15.7M (30.02MiB)
[default3]:07/06/2024 09:41:40 [INFO|DP=0|PP=2|TP=3|ip-26-0-162-180]: [After model building] Memory usage: 33.03MiB. Peak allocated: 35.06MiB Peak reserved: 50.00MiB
[default3]:07/06/2024 09:41:40 [INFO|DP=0|PP=2|TP=3|ip-26-0-162-180]: No checkpoint path provided.
[default5]:07/06/2024 09:41:40 [INFO|DP=0|PP=4|TP=5|ip-26-0-162-79]: Local number of parameters: 15.7M (30.02MiB)
[default5]:07/06/2024 09:41:40 [INFO|DP=0|PP=4|TP=5|ip-26-0-162-79]: [After model building] Memory usage: 33.03MiB. Peak allocated: 35.06MiB Peak reserved: 50.00MiB
[default1]:07/06/2024 09:41:40 [INFO|DP=0|PP=3|TP=1|ip-26-0-162-46]: Local number of parameters: 21M (40.03MiB)
[default1]:07/06/2024 09:41:40 [INFO|DP=0|PP=2|TP=1|ip-26-0-162-180]: Local number of parameters: 15.7M (30.02MiB)
[default1]:07/06/2024 09:41:40 [INFO|DP=0|PP=2|TP=1|ip-26-0-162-180]: [After model building] Memory usage: 33.03MiB. Peak allocated: 35.06MiB Peak reserved: 50.00MiB
[default1]:07/06/2024 09:41:40 [INFO|DP=0|PP=2|TP=1|ip-26-0-162-180]: No checkpoint path provided.
[default5]:07/06/2024 09:41:40 [INFO|DP=0|PP=4|TP=5|ip-26-0-162-79]: No checkpoint path provided.
[default1]:07/06/2024 09:41:40 [INFO|DP=0|PP=3|TP=1|ip-26-0-162-46]: [After model building] Memory usage: 44.04MiB. Peak allocated: 46.07MiB Peak reserved: 52.00MiB
[default4]:07/06/2024 09:41:40 [INFO|DP=0|PP=2|TP=4|ip-26-0-162-180]: Local number of parameters: 15.7M (30.02MiB)
[default4]:07/06/2024 09:41:40 [INFO|DP=0|PP=2|TP=4|ip-26-0-162-180]: [After model building] Memory usage: 33.03MiB. Peak allocated: 35.06MiB Peak reserved: 50.00MiB
[default0]:07/06/2024 09:41:40 [INFO|DP=0|PP=5|TP=0|ip-26-0-166-125]: Local number of parameters: 15.7M (30.02MiB)
[default0]:07/06/2024 09:41:40 [INFO|DP=0|PP=5|TP=0|ip-26-0-166-125]: [After model building] Memory usage: 33.03MiB. Peak allocated: 35.06MiB Peak reserved: 50.00MiB
[default0]:07/06/2024 09:41:40 [INFO|DP=0|PP=5|TP=0|ip-26-0-166-125]: No checkpoint path provided.
[default6]:07/06/2024 09:41:40 [INFO|DP=0|PP=4|TP=6|ip-26-0-162-79]: No checkpoint path provided.
[default1]:07/06/2024 09:41:40 [INFO|DP=0|PP=3|TP=1|ip-26-0-162-46]: No checkpoint path provided.
[default4]:07/06/2024 09:41:40 [INFO|DP=0|PP=3|TP=4|ip-26-0-162-46]: Local number of parameters: 21M (40.03MiB)
[default4]:07/06/2024 09:41:40 [INFO|DP=0|PP=2|TP=4|ip-26-0-162-180]: No checkpoint path provided.
[default2]:07/06/2024 09:41:40 [INFO|DP=0|PP=4|TP=2|ip-26-0-162-79]: Local number of parameters: 15.7M (30.02MiB)
[default2]:07/06/2024 09:41:40 [INFO|DP=0|PP=4|TP=2|ip-26-0-162-79]: [After model building] Memory usage: 33.03MiB. Peak allocated: 35.06MiB Peak reserved: 50.00MiB
[default2]:07/06/2024 09:41:40 [INFO|DP=0|PP=4|TP=2|ip-26-0-162-79]: No checkpoint path provided.
[default4]:07/06/2024 09:41:40 [INFO|DP=0|PP=3|TP=4|ip-26-0-162-46]: [After model building] Memory usage: 44.04MiB. Peak allocated: 46.07MiB Peak reserved: 52.00MiB
[default4]:07/06/2024 09:41:40 [INFO|DP=0|PP=3|TP=4|ip-26-0-162-46]: No checkpoint path provided.
[default0]:07/06/2024 09:41:40 [INFO|DP=0|PP=2|TP=0|ip-26-0-162-180]: Local number of parameters: 15.7M (30.02MiB)
[default0]:07/06/2024 09:41:40 [INFO|DP=0|PP=2|TP=0|ip-26-0-162-180]: [After model building] Memory usage: 33.03MiB. Peak allocated: 35.06MiB Peak reserved: 50.00MiB
[default0]:07/06/2024 09:41:40 [INFO|DP=0|PP=2|TP=0|ip-26-0-162-180]: No checkpoint path provided.
[default1]:07/06/2024 09:41:40 [INFO|DP=0|PP=4|TP=1|ip-26-0-162-79]: Local number of parameters: 15.7M (30.02MiB)
[default1]:07/06/2024 09:41:40 [INFO|DP=0|PP=4|TP=1|ip-26-0-162-79]: [After model building] Memory usage: 33.03MiB. Peak allocated: 35.06MiB Peak reserved: 50.00MiB
[default2]:07/06/2024 09:41:40 [INFO|DP=0|PP=3|TP=2|ip-26-0-162-46]: Local number of parameters: 21M (40.03MiB)
[default5]:07/06/2024 09:41:40 [INFO|DP=0|PP=0|TP=5|ip-26-0-161-221]: Local number of parameters: 33.9M (64.57MiB)
[default0]:07/06/2024 09:41:40 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: Total number of parameters: 1.21G (2314.22MiB)
[default5]:07/06/2024 09:41:40 [INFO|DP=0|PP=0|TP=5|ip-26-0-161-221]: [After model building] Memory usage: 68.59MiB. Peak allocated: 70.62MiB Peak reserved: 78.00MiB
[default7]:07/06/2024 09:41:40 [INFO|DP=0|PP=2|TP=7|ip-26-0-162-180]: Local number of parameters: 15.7M (30.02MiB)
[default7]:07/06/2024 09:41:40 [INFO|DP=0|PP=2|TP=7|ip-26-0-162-180]: [After model building] Memory usage: 33.03MiB. Peak allocated: 35.06MiB Peak reserved: 50.00MiB
[default6]:07/06/2024 09:41:40 [INFO|DP=0|PP=7|TP=6|ip-26-0-166-244]: Local number of parameters: 12.9M (24.55MiB)
[default1]:07/06/2024 09:41:40 [INFO|DP=0|PP=1|TP=1|ip-26-0-162-14]: Local number of parameters: 15.7M (30.02MiB)
[default1]:07/06/2024 09:41:40 [INFO|DP=0|PP=1|TP=1|ip-26-0-162-14]: [After model building] Memory usage: 33.03MiB. Peak allocated: 35.06MiB Peak reserved: 50.00MiB
[default1]:07/06/2024 09:41:40 [INFO|DP=0|PP=1|TP=1|ip-26-0-162-14]: No checkpoint path provided.
[default1]:07/06/2024 09:41:40 [INFO|DP=0|PP=4|TP=1|ip-26-0-162-79]: No checkpoint path provided.
[default2]:07/06/2024 09:41:40 [INFO|DP=0|PP=3|TP=2|ip-26-0-162-46]: [After model building] Memory usage: 44.04MiB. Peak allocated: 46.07MiB Peak reserved: 52.00MiB
[default2]:07/06/2024 09:41:40 [INFO|DP=0|PP=3|TP=2|ip-26-0-162-46]: No checkpoint path provided.
[default0]:07/06/2024 09:41:40 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: Local number of parameters: 33.9M (64.57MiB)
[default0]:07/06/2024 09:41:40 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: [After model building] Memory usage: 68.59MiB. Peak allocated: 70.62MiB Peak reserved: 78.00MiB
[default0]:07/06/2024 09:41:40 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: No checkpoint path provided.
[default0]:07/06/2024 09:41:40 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: Parametrizing model parameters using StandardParametrizator
[default7]:07/06/2024 09:41:40 [INFO|DP=0|PP=2|TP=7|ip-26-0-162-180]: No checkpoint path provided.
[default6]:07/06/2024 09:41:40 [INFO|DP=0|PP=7|TP=6|ip-26-0-166-244]: [After model building] Memory usage: 24.56MiB. Peak allocated: 24.58MiB Peak reserved: 28.00MiB
[default7]:07/06/2024 09:41:40 [INFO|DP=0|PP=1|TP=7|ip-26-0-162-14]: Local number of parameters: 15.7M (30.02MiB)
[default3]:07/06/2024 09:41:40 [INFO|DP=0|PP=4|TP=3|ip-26-0-162-79]: Local number of parameters: 15.7M (30.02MiB)
[default3]:07/06/2024 09:41:40 [INFO|DP=0|PP=4|TP=3|ip-26-0-162-79]: [After model building] Memory usage: 33.03MiB. Peak allocated: 35.06MiB Peak reserved: 50.00MiB
[default3]:07/06/2024 09:41:40 [INFO|DP=0|PP=4|TP=3|ip-26-0-162-79]: No checkpoint path provided.
[default6]:07/06/2024 09:41:40 [INFO|DP=0|PP=3|TP=6|ip-26-0-162-46]: Local number of parameters: 21M (40.03MiB)
[default6]:07/06/2024 09:41:40 [INFO|DP=0|PP=3|TP=6|ip-26-0-162-46]: [After model building] Memory usage: 44.04MiB. Peak allocated: 46.07MiB Peak reserved: 52.00MiB
[default5]:07/06/2024 09:41:40 [INFO|DP=0|PP=0|TP=5|ip-26-0-161-221]: No checkpoint path provided.
[default6]:07/06/2024 09:41:40 [INFO|DP=0|PP=2|TP=6|ip-26-0-162-180]: Local number of parameters: 15.7M (30.02MiB)
[default6]:07/06/2024 09:41:40 [INFO|DP=0|PP=7|TP=6|ip-26-0-166-244]: No checkpoint path provided.
[default0]:07/06/2024 09:41:40 [INFO|DP=0|PP=7|TP=0|ip-26-0-166-244]: Local number of parameters: 12.9M (24.55MiB)
[default7]:07/06/2024 09:41:40 [INFO|DP=0|PP=1|TP=7|ip-26-0-162-14]: [After model building] Memory usage: 33.03MiB. Peak allocated: 35.06MiB Peak reserved: 50.00MiB
[default4]:07/06/2024 09:41:40 [INFO|DP=0|PP=4|TP=4|ip-26-0-162-79]: Local number of parameters: 15.7M (30.02MiB)
[default4]:07/06/2024 09:41:40 [INFO|DP=0|PP=4|TP=4|ip-26-0-162-79]: [After model building] Memory usage: 33.03MiB. Peak allocated: 35.06MiB Peak reserved: 50.00MiB
[default4]:07/06/2024 09:41:40 [INFO|DP=0|PP=4|TP=4|ip-26-0-162-79]: No checkpoint path provided.
[default6]:07/06/2024 09:41:40 [INFO|DP=0|PP=3|TP=6|ip-26-0-162-46]: No checkpoint path provided.
[default6]:07/06/2024 09:41:40 [INFO|DP=0|PP=2|TP=6|ip-26-0-162-180]: [After model building] Memory usage: 33.03MiB. Peak allocated: 35.06MiB Peak reserved: 50.00MiB
[default6]:07/06/2024 09:41:40 [INFO|DP=0|PP=2|TP=6|ip-26-0-162-180]: No checkpoint path provided.
[default0]:07/06/2024 09:41:40 [INFO|DP=0|PP=7|TP=0|ip-26-0-166-244]: [After model building] Memory usage: 24.56MiB. Peak allocated: 24.58MiB Peak reserved: 28.00MiB
[default7]:07/06/2024 09:41:40 [INFO|DP=0|PP=1|TP=7|ip-26-0-162-14]: No checkpoint path provided.
[default1]:07/06/2024 09:41:40 [INFO|DP=0|PP=5|TP=1|ip-26-0-166-125]: Local number of parameters: 15.7M (30.02MiB)
[default1]:07/06/2024 09:41:40 [INFO|DP=0|PP=5|TP=1|ip-26-0-166-125]: [After model building] Memory usage: 33.03MiB. Peak allocated: 35.06MiB Peak reserved: 50.00MiB
[default1]:07/06/2024 09:41:40 [INFO|DP=0|PP=5|TP=1|ip-26-0-166-125]: No checkpoint path provided.
[default3]:07/06/2024 09:41:40 [INFO|DP=0|PP=3|TP=3|ip-26-0-162-46]: Local number of parameters: 21M (40.03MiB)
[default3]:07/06/2024 09:41:40 [INFO|DP=0|PP=3|TP=3|ip-26-0-162-46]: [After model building] Memory usage: 44.04MiB. Peak allocated: 46.07MiB Peak reserved: 52.00MiB
[default5]:07/06/2024 09:41:40 [INFO|DP=0|PP=2|TP=5|ip-26-0-162-180]: Local number of parameters: 15.7M (30.02MiB)
[default5]:07/06/2024 09:41:40 [INFO|DP=0|PP=2|TP=5|ip-26-0-162-180]: [After model building] Memory usage: 33.03MiB. Peak allocated: 35.06MiB Peak reserved: 50.00MiB
[default5]:07/06/2024 09:41:40 [INFO|DP=0|PP=2|TP=5|ip-26-0-162-180]: No checkpoint path provided.
[default0]:07/06/2024 09:41:40 [INFO|DP=0|PP=7|TP=0|ip-26-0-166-244]: No checkpoint path provided.
[default5]:07/06/2024 09:41:40 [INFO|DP=0|PP=1|TP=5|ip-26-0-162-14]: Local number of parameters: 15.7M (30.02MiB)
[default2]:07/06/2024 09:41:40 [INFO|DP=0|PP=1|TP=2|ip-26-0-162-14]: Local number of parameters: 15.7M (30.02MiB)
[default2]:07/06/2024 09:41:40 [INFO|DP=0|PP=1|TP=2|ip-26-0-162-14]: [After model building] Memory usage: 33.03MiB. Peak allocated: 35.06MiB Peak reserved: 50.00MiB
[default3]:07/06/2024 09:41:40 [INFO|DP=0|PP=5|TP=3|ip-26-0-166-125]: Local number of parameters: 15.7M (30.02MiB)
[default3]:07/06/2024 09:41:40 [INFO|DP=0|PP=3|TP=3|ip-26-0-162-46]: No checkpoint path provided.
[default5]:07/06/2024 09:41:40 [INFO|DP=0|PP=7|TP=5|ip-26-0-166-244]: Local number of parameters: 12.9M (24.55MiB)
[default5]:07/06/2024 09:41:40 [INFO|DP=0|PP=1|TP=5|ip-26-0-162-14]: [After model building] Memory usage: 33.03MiB. Peak allocated: 35.06MiB Peak reserved: 50.00MiB
[default3]:07/06/2024 09:41:40 [INFO|DP=0|PP=1|TP=3|ip-26-0-162-14]: Local number of parameters: 15.7M (30.02MiB)
[default2]:07/06/2024 09:41:40 [INFO|DP=0|PP=1|TP=2|ip-26-0-162-14]: No checkpoint path provided.
[default5]:07/06/2024 09:41:40 [INFO|DP=0|PP=1|TP=5|ip-26-0-162-14]: No checkpoint path provided.
[default3]:07/06/2024 09:41:40 [INFO|DP=0|PP=5|TP=3|ip-26-0-166-125]: [After model building] Memory usage: 33.03MiB. Peak allocated: 35.06MiB Peak reserved: 50.00MiB
[default3]:07/06/2024 09:41:40 [INFO|DP=0|PP=5|TP=3|ip-26-0-166-125]: No checkpoint path provided.
[default6]:07/06/2024 09:41:40 [INFO|DP=0|PP=6|TP=6|ip-26-0-166-214]: Local number of parameters: 21M (40.03MiB)
[default6]:07/06/2024 09:41:40 [INFO|DP=0|PP=6|TP=6|ip-26-0-166-214]: [After model building] Memory usage: 44.04MiB. Peak allocated: 46.07MiB Peak reserved: 52.00MiB
[default6]:07/06/2024 09:41:40 [INFO|DP=0|PP=6|TP=6|ip-26-0-166-214]: No checkpoint path provided.
[default0]:07/06/2024 09:41:40 [INFO|DP=0|PP=3|TP=0|ip-26-0-162-46]: Local number of parameters: 21M (40.03MiB)
[default0]:07/06/2024 09:41:40 [INFO|DP=0|PP=3|TP=0|ip-26-0-162-46]: [After model building] Memory usage: 44.04MiB. Peak allocated: 46.07MiB Peak reserved: 52.00MiB
[default0]:07/06/2024 09:41:40 [INFO|DP=0|PP=3|TP=0|ip-26-0-162-46]: No checkpoint path provided.
[default3]:07/06/2024 09:41:40 [INFO|DP=0|PP=0|TP=3|ip-26-0-161-221]: Local number of parameters: 33.9M (64.57MiB)
[default3]:07/06/2024 09:41:40 [INFO|DP=0|PP=0|TP=3|ip-26-0-161-221]: [After model building] Memory usage: 68.59MiB. Peak allocated: 70.62MiB Peak reserved: 78.00MiB
[default3]:07/06/2024 09:41:40 [INFO|DP=0|PP=0|TP=3|ip-26-0-161-221]: No checkpoint path provided.
[default5]:07/06/2024 09:41:40 [INFO|DP=0|PP=7|TP=5|ip-26-0-166-244]: [After model building] Memory usage: 24.56MiB. Peak allocated: 24.58MiB Peak reserved: 28.00MiB
[default3]:07/06/2024 09:41:40 [INFO|DP=0|PP=1|TP=3|ip-26-0-162-14]: [After model building] Memory usage: 33.03MiB. Peak allocated: 35.06MiB Peak reserved: 50.00MiB
[default6]:07/06/2024 09:41:40 [INFO|DP=0|PP=5|TP=6|ip-26-0-166-125]: Local number of parameters: 15.7M (30.02MiB)
[default6]:07/06/2024 09:41:40 [INFO|DP=0|PP=5|TP=6|ip-26-0-166-125]: [After model building] Memory usage: 33.03MiB. Peak allocated: 35.06MiB Peak reserved: 50.00MiB
[default2]:07/06/2024 09:41:40 [INFO|DP=0|PP=6|TP=2|ip-26-0-166-214]: Local number of parameters: 21M (40.03MiB)
[default2]:07/06/2024 09:41:40 [INFO|DP=0|PP=6|TP=2|ip-26-0-166-214]: [After model building] Memory usage: 44.04MiB. Peak allocated: 46.07MiB Peak reserved: 52.00MiB
[default1]:07/06/2024 09:41:40 [INFO|DP=0|PP=0|TP=1|ip-26-0-161-221]: Local number of parameters: 33.9M (64.57MiB)
[default1]:07/06/2024 09:41:40 [INFO|DP=0|PP=0|TP=1|ip-26-0-161-221]: [After model building] Memory usage: 68.59MiB. Peak allocated: 70.62MiB Peak reserved: 78.00MiB
[default5]:07/06/2024 09:41:40 [INFO|DP=0|PP=7|TP=5|ip-26-0-166-244]: No checkpoint path provided.
[default3]:07/06/2024 09:41:40 [INFO|DP=0|PP=1|TP=3|ip-26-0-162-14]: No checkpoint path provided.
[default6]:07/06/2024 09:41:40 [INFO|DP=0|PP=5|TP=6|ip-26-0-166-125]: No checkpoint path provided.
[default2]:07/06/2024 09:41:40 [INFO|DP=0|PP=6|TP=2|ip-26-0-166-214]: No checkpoint path provided.
[default1]:07/06/2024 09:41:40 [INFO|DP=0|PP=0|TP=1|ip-26-0-161-221]: No checkpoint path provided.
[default1]:07/06/2024 09:41:40 [INFO|DP=0|PP=7|TP=1|ip-26-0-166-244]: Local number of parameters: 12.9M (24.55MiB)
[default1]:07/06/2024 09:41:40 [INFO|DP=0|PP=7|TP=1|ip-26-0-166-244]: [After model building] Memory usage: 24.56MiB. Peak allocated: 24.58MiB Peak reserved: 28.00MiB
[default0]:07/06/2024 09:41:40 [INFO|DP=0|PP=1|TP=0|ip-26-0-162-14]: Local number of parameters: 15.7M (30.02MiB)
[default2]:07/06/2024 09:41:40 [INFO|DP=0|PP=5|TP=2|ip-26-0-166-125]: Local number of parameters: 15.7M (30.02MiB)
[default3]:07/06/2024 09:41:40 [INFO|DP=0|PP=6|TP=3|ip-26-0-166-214]: Local number of parameters: 21M (40.03MiB)
[default6]:07/06/2024 09:41:40 [INFO|DP=0|PP=0|TP=6|ip-26-0-161-221]: Local number of parameters: 33.9M (64.57MiB)
[default1]:07/06/2024 09:41:40 [INFO|DP=0|PP=7|TP=1|ip-26-0-166-244]: No checkpoint path provided.
[default0]:07/06/2024 09:41:40 [INFO|DP=0|PP=1|TP=0|ip-26-0-162-14]: [After model building] Memory usage: 33.03MiB. Peak allocated: 35.06MiB Peak reserved: 50.00MiB
[default0]:07/06/2024 09:41:40 [INFO|DP=0|PP=1|TP=0|ip-26-0-162-14]: No checkpoint path provided.
[default2]:07/06/2024 09:41:40 [INFO|DP=0|PP=5|TP=2|ip-26-0-166-125]: [After model building] Memory usage: 33.03MiB. Peak allocated: 35.06MiB Peak reserved: 50.00MiB
[default3]:07/06/2024 09:41:40 [INFO|DP=0|PP=6|TP=3|ip-26-0-166-214]: [After model building] Memory usage: 44.04MiB. Peak allocated: 46.07MiB Peak reserved: 52.00MiB
[default1]:07/06/2024 09:41:40 [INFO|DP=0|PP=6|TP=1|ip-26-0-166-214]: Local number of parameters: 21M (40.03MiB)
[default1]:07/06/2024 09:41:40 [INFO|DP=0|PP=6|TP=1|ip-26-0-166-214]: [After model building] Memory usage: 44.04MiB. Peak allocated: 46.07MiB Peak reserved: 52.00MiB
[default6]:07/06/2024 09:41:40 [INFO|DP=0|PP=0|TP=6|ip-26-0-161-221]: [After model building] Memory usage: 68.59MiB. Peak allocated: 70.62MiB Peak reserved: 78.00MiB
[default6]:07/06/2024 09:41:40 [INFO|DP=0|PP=0|TP=6|ip-26-0-161-221]: No checkpoint path provided.
[default4]:07/06/2024 09:41:40 [INFO|DP=0|PP=7|TP=4|ip-26-0-166-244]: Local number of parameters: 12.9M (24.55MiB)
[default6]:07/06/2024 09:41:40 [INFO|DP=0|PP=1|TP=6|ip-26-0-162-14]: Local number of parameters: 15.7M (30.02MiB)
[default6]:07/06/2024 09:41:40 [INFO|DP=0|PP=1|TP=6|ip-26-0-162-14]: [After model building] Memory usage: 33.03MiB. Peak allocated: 35.06MiB Peak reserved: 50.00MiB
[default5]:07/06/2024 09:41:40 [INFO|DP=0|PP=5|TP=5|ip-26-0-166-125]: Local number of parameters: 15.7M (30.02MiB)
[default1]:07/06/2024 09:41:40 [INFO|DP=0|PP=6|TP=1|ip-26-0-166-214]: No checkpoint path provided.
[default3]:07/06/2024 09:41:40 [INFO|DP=0|PP=6|TP=3|ip-26-0-166-214]: No checkpoint path provided.
[default4]:07/06/2024 09:41:40 [INFO|DP=0|PP=0|TP=4|ip-26-0-161-221]: Local number of parameters: 33.9M (64.57MiB)
[default4]:07/06/2024 09:41:40 [INFO|DP=0|PP=7|TP=4|ip-26-0-166-244]: [After model building] Memory usage: 24.56MiB. Peak allocated: 24.58MiB Peak reserved: 28.00MiB
[default6]:07/06/2024 09:41:40 [INFO|DP=0|PP=1|TP=6|ip-26-0-162-14]: No checkpoint path provided.
[default5]:07/06/2024 09:41:40 [INFO|DP=0|PP=5|TP=5|ip-26-0-166-125]: [After model building] Memory usage: 33.03MiB. Peak allocated: 35.06MiB Peak reserved: 50.00MiB
[default7]:07/06/2024 09:41:40 [INFO|DP=0|PP=6|TP=7|ip-26-0-166-214]: Local number of parameters: 21M (40.03MiB)
[default4]:07/06/2024 09:41:40 [INFO|DP=0|PP=0|TP=4|ip-26-0-161-221]: [After model building] Memory usage: 68.59MiB. Peak allocated: 70.62MiB Peak reserved: 78.00MiB
[default4]:07/06/2024 09:41:40 [INFO|DP=0|PP=0|TP=4|ip-26-0-161-221]: No checkpoint path provided.
[default3]:07/06/2024 09:41:40 [INFO|DP=0|PP=7|TP=3|ip-26-0-166-244]: Local number of parameters: 12.9M (24.55MiB)
[default4]:07/06/2024 09:41:40 [INFO|DP=0|PP=1|TP=4|ip-26-0-162-14]: Local number of parameters: 15.7M (30.02MiB)
[default2]:07/06/2024 09:41:40 [INFO|DP=0|PP=5|TP=2|ip-26-0-166-125]: No checkpoint path provided.
[default0]:07/06/2024 09:41:40 [INFO|DP=0|PP=6|TP=0|ip-26-0-166-214]: Local number of parameters: 21M (40.03MiB)
[default7]:07/06/2024 09:41:40 [INFO|DP=0|PP=0|TP=7|ip-26-0-161-221]: Local number of parameters: 33.9M (64.57MiB)
[default7]:07/06/2024 09:41:40 [INFO|DP=0|PP=0|TP=7|ip-26-0-161-221]: [After model building] Memory usage: 68.59MiB. Peak allocated: 70.62MiB Peak reserved: 78.00MiB
[default4]:07/06/2024 09:41:40 [INFO|DP=0|PP=7|TP=4|ip-26-0-166-244]: No checkpoint path provided.
[default4]:07/06/2024 09:41:40 [INFO|DP=0|PP=1|TP=4|ip-26-0-162-14]: [After model building] Memory usage: 33.03MiB. Peak allocated: 35.06MiB Peak reserved: 50.00MiB
[default5]:07/06/2024 09:41:40 [INFO|DP=0|PP=5|TP=5|ip-26-0-166-125]: No checkpoint path provided.
[default0]:07/06/2024 09:41:40 [INFO|DP=0|PP=6|TP=0|ip-26-0-166-214]: [After model building] Memory usage: 44.04MiB. Peak allocated: 46.07MiB Peak reserved: 52.00MiB
[default7]:07/06/2024 09:41:40 [INFO|DP=0|PP=6|TP=7|ip-26-0-166-214]: [After model building] Memory usage: 44.04MiB. Peak allocated: 46.07MiB Peak reserved: 52.00MiB
[default7]:07/06/2024 09:41:40 [INFO|DP=0|PP=6|TP=7|ip-26-0-166-214]: No checkpoint path provided.
[default7]:07/06/2024 09:41:40 [INFO|DP=0|PP=0|TP=7|ip-26-0-161-221]: No checkpoint path provided.
[default3]:07/06/2024 09:41:40 [INFO|DP=0|PP=7|TP=3|ip-26-0-166-244]: [After model building] Memory usage: 24.56MiB. Peak allocated: 24.58MiB Peak reserved: 28.00MiB
[default3]:07/06/2024 09:41:40 [INFO|DP=0|PP=7|TP=3|ip-26-0-166-244]: No checkpoint path provided.
[default4]:07/06/2024 09:41:40 [INFO|DP=0|PP=1|TP=4|ip-26-0-162-14]: No checkpoint path provided.
[default7]:07/06/2024 09:41:40 [INFO|DP=0|PP=5|TP=7|ip-26-0-166-125]: Local number of parameters: 15.7M (30.02MiB)
[default0]:07/06/2024 09:41:40 [INFO|DP=0|PP=6|TP=0|ip-26-0-166-214]: No checkpoint path provided.
[default2]:07/06/2024 09:41:40 [INFO|DP=0|PP=0|TP=2|ip-26-0-161-221]: Local number of parameters: 33.9M (64.57MiB)
[default2]:07/06/2024 09:41:40 [INFO|DP=0|PP=0|TP=2|ip-26-0-161-221]: [After model building] Memory usage: 68.59MiB. Peak allocated: 70.62MiB Peak reserved: 78.00MiB
[default2]:07/06/2024 09:41:40 [INFO|DP=0|PP=0|TP=2|ip-26-0-161-221]: No checkpoint path provided.
[default4]:07/06/2024 09:41:40 [INFO|DP=0|PP=5|TP=4|ip-26-0-166-125]: Local number of parameters: 15.7M (30.02MiB)
[default4]:07/06/2024 09:41:40 [INFO|DP=0|PP=6|TP=4|ip-26-0-166-214]: Local number of parameters: 21M (40.03MiB)
[default4]:07/06/2024 09:41:40 [INFO|DP=0|PP=6|TP=4|ip-26-0-166-214]: [After model building] Memory usage: 44.04MiB. Peak allocated: 46.07MiB Peak reserved: 52.00MiB
[default4]:07/06/2024 09:41:40 [INFO|DP=0|PP=6|TP=4|ip-26-0-166-214]: No checkpoint path provided.
[default7]:07/06/2024 09:41:40 [INFO|DP=0|PP=5|TP=7|ip-26-0-166-125]: [After model building] Memory usage: 33.03MiB. Peak allocated: 35.06MiB Peak reserved: 50.00MiB
[default5]:07/06/2024 09:41:40 [INFO|DP=0|PP=6|TP=5|ip-26-0-166-214]: Local number of parameters: 21M (40.03MiB)
[default5]:07/06/2024 09:41:40 [INFO|DP=0|PP=6|TP=5|ip-26-0-166-214]: [After model building] Memory usage: 44.04MiB. Peak allocated: 46.07MiB Peak reserved: 52.00MiB
[default5]:07/06/2024 09:41:40 [INFO|DP=0|PP=6|TP=5|ip-26-0-166-214]: No checkpoint path provided.
[default4]:07/06/2024 09:41:40 [INFO|DP=0|PP=5|TP=4|ip-26-0-166-125]: [After model building] Memory usage: 33.03MiB. Peak allocated: 35.06MiB Peak reserved: 50.00MiB
[default4]:07/06/2024 09:41:40 [INFO|DP=0|PP=5|TP=4|ip-26-0-166-125]: No checkpoint path provided.
[default7]:07/06/2024 09:41:40 [INFO|DP=0|PP=5|TP=7|ip-26-0-166-125]: No checkpoint path provided.
[default0]:07/06/2024 09:41:41 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: [Optimizer Building] Using LearningRateForSP as learning rate
[default0]:07/06/2024 09:41:41 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: [ZeRO sharding] Size of optimizer params per rank:
[default0]:07/06/2024 09:41:41 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: [ZeRO sharding] DP Rank 0 has 33.9M out of 33.9M (100.00%) params' optimizer states
[default0]:07/06/2024 09:41:42 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: [Training Plan] Stage Training Stage has 19 remaining training steps and has consumed 0 samples
[default0]:07/06/2024 09:41:42 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: Using `datasets` library
[default0]:07/06/2024 09:41:42 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: Loading tokenizer from openai-community/gpt2 and transformers/hf_hub versions ('4.41.2', '0.23.4')
[default0]:07/06/2024 09:41:42 [WARNING|DP=0|PP=0|TP=0|ip-26-0-161-221]: Repo card metadata block was not found. Setting CardData to empty.
[default0]:Repo card metadata block was not found. Setting CardData to empty.
[default0]:07/06/2024 09:41:43 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: [Training Plan] There are 1 training stages
[default0]:07/06/2024 09:41:43 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: [Stage Training Stage] start from step 1
[default0]:07/06/2024 09:41:43 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]:
[default0]:07/06/2024 09:41:43 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: [Start training] datetime: 2024-07-06 09:41:43.986113 | mbs: 4 | grad_accum: 256 | global_batch_size: 1024 | sequence_length: 4096 | train_steps: 20 | start_iteration_step: 0 | consumed_train_samples: 0
[default1]:Repo card metadata block was not found. Setting CardData to empty.
[default7]:Repo card metadata block was not found. Setting CardData to empty.
[default4]:Repo card metadata block was not found. Setting CardData to empty.
[default6]:07/06/2024 09:41:44 [WARNING|DP=0|PP=4|TP=6|ip-26-0-162-79]: Repo card metadata block was not found. Setting CardData to empty.
[default2]:07/06/2024 09:41:44 [WARNING|DP=0|PP=7|TP=2|ip-26-0-166-244]: Repo card metadata block was not found. Setting CardData to empty.
[default2]:Repo card metadata block was not found. Setting CardData to empty.
[default7]:07/06/2024 09:41:44 [WARNING|DP=0|PP=7|TP=7|ip-26-0-166-244]: Repo card metadata block was not found. Setting CardData to empty.
[default6]:Repo card metadata block was not found. Setting CardData to empty.
[default6]:Repo card metadata block was not found. Setting CardData to empty.
[default5]:07/06/2024 09:41:44 [WARNING|DP=0|PP=3|TP=5|ip-26-0-162-46]: Repo card metadata block was not found. Setting CardData to empty.
[default0]:Repo card metadata block was not found. Setting CardData to empty.
[default7]:07/06/2024 09:41:44 [WARNING|DP=0|PP=3|TP=7|ip-26-0-162-46]: Repo card metadata block was not found. Setting CardData to empty.
[default4]:07/06/2024 09:41:44 [WARNING|DP=0|PP=3|TP=4|ip-26-0-162-46]: Repo card metadata block was not found. Setting CardData to empty.
[default1]:07/06/2024 09:41:44 [WARNING|DP=0|PP=2|TP=1|ip-26-0-162-180]: Repo card metadata block was not found. Setting CardData to empty.
[default5]:Repo card metadata block was not found. Setting CardData to empty.
[default0]:07/06/2024 09:41:44 [WARNING|DP=0|PP=3|TP=0|ip-26-0-162-46]: Repo card metadata block was not found. Setting CardData to empty.
[default7]:Repo card metadata block was not found. Setting CardData to empty.
[default7]:07/06/2024 09:41:44 [WARNING|DP=0|PP=2|TP=7|ip-26-0-162-180]: Repo card metadata block was not found. Setting CardData to empty.
[default5]:07/06/2024 09:41:44 [WARNING|DP=0|PP=2|TP=5|ip-26-0-162-180]: Repo card metadata block was not found. Setting CardData to empty.
[default6]:Repo card metadata block was not found. Setting CardData to empty.
[default5]:Repo card metadata block was not found. Setting CardData to empty.
[default6]:07/06/2024 09:41:44 [WARNING|DP=0|PP=7|TP=6|ip-26-0-166-244]: Repo card metadata block was not found. Setting CardData to empty.
[default1]:07/06/2024 09:41:44 [WARNING|DP=0|PP=1|TP=1|ip-26-0-162-14]: Repo card metadata block was not found. Setting CardData to empty.
[default7]:07/06/2024 09:41:44 [WARNING|DP=0|PP=1|TP=7|ip-26-0-162-14]: Repo card metadata block was not found. Setting CardData to empty.
[default1]:07/06/2024 09:41:44 [WARNING|DP=0|PP=7|TP=1|ip-26-0-166-244]: Repo card metadata block was not found. Setting CardData to empty.
[default1]:Repo card metadata block was not found. Setting CardData to empty.
[default6]:07/06/2024 09:41:44 [WARNING|DP=0|PP=5|TP=6|ip-26-0-166-125]: Repo card metadata block was not found. Setting CardData to empty.
[default3]:07/06/2024 09:41:44 [WARNING|DP=0|PP=6|TP=3|ip-26-0-166-214]: Repo card metadata block was not found. Setting CardData to empty.
[default1]:07/06/2024 09:41:44 [WARNING|DP=0|PP=6|TP=1|ip-26-0-166-214]: Repo card metadata block was not found. Setting CardData to empty.
[default1]:Repo card metadata block was not found. Setting CardData to empty.
[default6]:07/06/2024 09:41:44 [WARNING|DP=0|PP=0|TP=6|ip-26-0-161-221]: Repo card metadata block was not found. Setting CardData to empty.
[default1]:Repo card metadata block was not found. Setting CardData to empty.
[default4]:07/06/2024 09:41:44 [WARNING|DP=0|PP=0|TP=4|ip-26-0-161-221]: Repo card metadata block was not found. Setting CardData to empty.
[default4]:07/06/2024 09:41:44 [WARNING|DP=0|PP=7|TP=4|ip-26-0-166-244]: Repo card metadata block was not found. Setting CardData to empty.
[default2]:Repo card metadata block was not found. Setting CardData to empty.
[default7]:07/06/2024 09:41:44 [WARNING|DP=0|PP=6|TP=7|ip-26-0-166-214]: Repo card metadata block was not found. Setting CardData to empty.
[default4]:Repo card metadata block was not found. Setting CardData to empty.
[default1]:07/06/2024 09:41:44 [WARNING|DP=0|PP=0|TP=1|ip-26-0-161-221]: Repo card metadata block was not found. Setting CardData to empty.
[default2]:Repo card metadata block was not found. Setting CardData to empty.
[default2]:07/06/2024 09:41:44 [WARNING|DP=0|PP=0|TP=2|ip-26-0-161-221]: Repo card metadata block was not found. Setting CardData to empty.
[default4]:Repo card metadata block was not found. Setting CardData to empty.
[default3]:Repo card metadata block was not found. Setting CardData to empty.
[default4]:07/06/2024 09:41:44 [WARNING|DP=0|PP=6|TP=4|ip-26-0-166-214]: Repo card metadata block was not found. Setting CardData to empty.
[default5]:07/06/2024 09:41:44 [WARNING|DP=0|PP=6|TP=5|ip-26-0-166-214]: Repo card metadata block was not found. Setting CardData to empty.
[default4]:Repo card metadata block was not found. Setting CardData to empty.
[default5]:Repo card metadata block was not found. Setting CardData to empty.
[default1]:Repo card metadata block was not found. Setting CardData to empty.
[default7]:Repo card metadata block was not found. Setting CardData to empty.
[default7]:Repo card metadata block was not found. Setting CardData to empty.
[default7]:Repo card metadata block was not found. Setting CardData to empty.
[default6]:Repo card metadata block was not found. Setting CardData to empty.
[default2]:07/06/2024 09:41:44 [WARNING|DP=0|PP=2|TP=2|ip-26-0-162-180]: Repo card metadata block was not found. Setting CardData to empty.
[default6]:Repo card metadata block was not found. Setting CardData to empty.
[default2]:Repo card metadata block was not found. Setting CardData to empty.
[default3]:Repo card metadata block was not found. Setting CardData to empty.
[default5]:07/06/2024 09:41:44 [WARNING|DP=0|PP=4|TP=5|ip-26-0-162-79]: Repo card metadata block was not found. Setting CardData to empty.
[default6]:Repo card metadata block was not found. Setting CardData to empty.
[default7]:07/06/2024 09:41:44 [WARNING|DP=0|PP=4|TP=7|ip-26-0-162-79]: Repo card metadata block was not found. Setting CardData to empty.
[default7]:Repo card metadata block was not found. Setting CardData to empty.
[default0]:07/06/2024 09:41:44 [WARNING|DP=0|PP=4|TP=0|ip-26-0-162-79]: Repo card metadata block was not found. Setting CardData to empty.
[default5]:Repo card metadata block was not found. Setting CardData to empty.
[default3]:Repo card metadata block was not found. Setting CardData to empty.
[default1]:Repo card metadata block was not found. Setting CardData to empty.
[default6]:07/06/2024 09:41:44 [WARNING|DP=0|PP=3|TP=6|ip-26-0-162-46]: Repo card metadata block was not found. Setting CardData to empty.
[default0]:Repo card metadata block was not found. Setting CardData to empty.
[default1]:07/06/2024 09:41:44 [WARNING|DP=0|PP=4|TP=1|ip-26-0-162-79]: Repo card metadata block was not found. Setting CardData to empty.
[default1]:07/06/2024 09:41:44 [WARNING|DP=0|PP=3|TP=1|ip-26-0-162-46]: Repo card metadata block was not found. Setting CardData to empty.
[default3]:Repo card metadata block was not found. Setting CardData to empty.
[default2]:07/06/2024 09:41:44 [WARNING|DP=0|PP=3|TP=2|ip-26-0-162-46]: Repo card metadata block was not found. Setting CardData to empty.
[default3]:07/06/2024 09:41:44 [WARNING|DP=0|PP=4|TP=3|ip-26-0-162-79]: Repo card metadata block was not found. Setting CardData to empty.
[default5]:Repo card metadata block was not found. Setting CardData to empty.
[default3]:07/06/2024 09:41:44 [WARNING|DP=0|PP=3|TP=3|ip-26-0-162-46]: Repo card metadata block was not found. Setting CardData to empty.
[default1]:Repo card metadata block was not found. Setting CardData to empty.
[default3]:07/06/2024 09:41:44 [WARNING|DP=0|PP=2|TP=3|ip-26-0-162-180]: Repo card metadata block was not found. Setting CardData to empty.
[default2]:Repo card metadata block was not found. Setting CardData to empty.
[default4]:07/06/2024 09:41:44 [WARNING|DP=0|PP=2|TP=4|ip-26-0-162-180]: Repo card metadata block was not found. Setting CardData to empty.
[default0]:07/06/2024 09:41:44 [WARNING|DP=0|PP=2|TP=0|ip-26-0-162-180]: Repo card metadata block was not found. Setting CardData to empty.
[default4]:Repo card metadata block was not found. Setting CardData to empty.
[default0]:07/06/2024 09:41:44 [WARNING|DP=0|PP=5|TP=0|ip-26-0-166-125]: Repo card metadata block was not found. Setting CardData to empty.
[default5]:07/06/2024 09:41:44 [WARNING|DP=0|PP=0|TP=5|ip-26-0-161-221]: Repo card metadata block was not found. Setting CardData to empty.
[default6]:07/06/2024 09:41:44 [WARNING|DP=0|PP=2|TP=6|ip-26-0-162-180]: Repo card metadata block was not found. Setting CardData to empty.
[default0]:Repo card metadata block was not found. Setting CardData to empty.
[default0]:07/06/2024 09:41:44 [WARNING|DP=0|PP=7|TP=0|ip-26-0-166-244]: Repo card metadata block was not found. Setting CardData to empty.
[default3]:07/06/2024 09:41:44 [WARNING|DP=0|PP=1|TP=3|ip-26-0-162-14]: Repo card metadata block was not found. Setting CardData to empty.
[default5]:07/06/2024 09:41:44 [WARNING|DP=0|PP=1|TP=5|ip-26-0-162-14]: Repo card metadata block was not found. Setting CardData to empty.
[default2]:07/06/2024 09:41:44 [WARNING|DP=0|PP=1|TP=2|ip-26-0-162-14]: Repo card metadata block was not found. Setting CardData to empty.
[default5]:Repo card metadata block was not found. Setting CardData to empty.
[default5]:07/06/2024 09:41:44 [WARNING|DP=0|PP=7|TP=5|ip-26-0-166-244]: Repo card metadata block was not found. Setting CardData to empty.
[default0]:07/06/2024 09:41:44 [WARNING|DP=0|PP=1|TP=0|ip-26-0-162-14]: Repo card metadata block was not found. Setting CardData to empty.
[default0]:Repo card metadata block was not found. Setting CardData to empty.
[default1]:07/06/2024 09:41:44 [WARNING|DP=0|PP=5|TP=1|ip-26-0-166-125]: Repo card metadata block was not found. Setting CardData to empty.
[default2]:Repo card metadata block was not found. Setting CardData to empty.
[default6]:07/06/2024 09:41:44 [WARNING|DP=0|PP=6|TP=6|ip-26-0-166-214]: Repo card metadata block was not found. Setting CardData to empty.
[default5]:07/06/2024 09:41:44 [WARNING|DP=0|PP=5|TP=5|ip-26-0-166-125]: Repo card metadata block was not found. Setting CardData to empty.
[default3]:Repo card metadata block was not found. Setting CardData to empty.
[default2]:07/06/2024 09:41:44 [WARNING|DP=0|PP=6|TP=2|ip-26-0-166-214]: Repo card metadata block was not found. Setting CardData to empty.
[default0]:Repo card metadata block was not found. Setting CardData to empty.
[default3]:07/06/2024 09:41:44 [WARNING|DP=0|PP=0|TP=3|ip-26-0-161-221]: Repo card metadata block was not found. Setting CardData to empty.
[default0]:Repo card metadata block was not found. Setting CardData to empty.
[default3]:07/06/2024 09:41:44 [WARNING|DP=0|PP=7|TP=3|ip-26-0-166-244]: Repo card metadata block was not found. Setting CardData to empty.
[default3]:Repo card metadata block was not found. Setting CardData to empty.
[default0]:07/06/2024 09:41:44 [WARNING|DP=0|PP=6|TP=0|ip-26-0-166-214]: Repo card metadata block was not found. Setting CardData to empty.
[default2]:Repo card metadata block was not found. Setting CardData to empty.
[default4]:07/06/2024 09:41:44 [WARNING|DP=0|PP=5|TP=4|ip-26-0-166-125]: Repo card metadata block was not found. Setting CardData to empty.
[default5]:Repo card metadata block was not found. Setting CardData to empty.
[default7]:07/06/2024 09:41:44 [WARNING|DP=0|PP=0|TP=7|ip-26-0-161-221]: Repo card metadata block was not found. Setting CardData to empty.
[default7]:Repo card metadata block was not found. Setting CardData to empty.
[default4]:07/06/2024 09:41:44 [WARNING|DP=0|PP=1|TP=4|ip-26-0-162-14]: Repo card metadata block was not found. Setting CardData to empty.
[default3]:Repo card metadata block was not found. Setting CardData to empty.
[default7]:07/06/2024 09:41:44 [WARNING|DP=0|PP=5|TP=7|ip-26-0-166-125]: Repo card metadata block was not found. Setting CardData to empty.
[default5]:Repo card metadata block was not found. Setting CardData to empty.
[default2]:07/06/2024 09:41:44 [WARNING|DP=0|PP=5|TP=2|ip-26-0-166-125]: Repo card metadata block was not found. Setting CardData to empty.
[default4]:Repo card metadata block was not found. Setting CardData to empty.
[default0]:Repo card metadata block was not found. Setting CardData to empty.
[default1]:Repo card metadata block was not found. Setting CardData to empty.
[default6]:Repo card metadata block was not found. Setting CardData to empty.
[default7]:Repo card metadata block was not found. Setting CardData to empty.
[default4]:Repo card metadata block was not found. Setting CardData to empty.
[default4]:Repo card metadata block was not found. Setting CardData to empty.
[default4]:07/06/2024 09:41:44 [WARNING|DP=0|PP=4|TP=4|ip-26-0-162-79]: Repo card metadata block was not found. Setting CardData to empty.
[default3]:07/06/2024 09:41:44 [WARNING|DP=0|PP=5|TP=3|ip-26-0-166-125]: Repo card metadata block was not found. Setting CardData to empty.
[default3]:Repo card metadata block was not found. Setting CardData to empty.
[default6]:07/06/2024 09:41:44 [WARNING|DP=0|PP=1|TP=6|ip-26-0-162-14]: Repo card metadata block was not found. Setting CardData to empty.
[default6]:Repo card metadata block was not found. Setting CardData to empty.
[default2]:07/06/2024 09:41:44 [WARNING|DP=0|PP=4|TP=2|ip-26-0-162-79]: Repo card metadata block was not found. Setting CardData to empty.
[default2]:Repo card metadata block was not found. Setting CardData to empty.
[default0]:07/06/2024 09:41:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: Resuming training from stage Training Stage, it has trained for 0 samples and has 19 remaining train steps
[default0]:07/06/2024 09:41:50 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-221]: Memory usage: 328.58MiB. Peak allocated 328.59MiB. Peak reserved: 338.00MiB
[default1]:Traceback (most recent call last):
[default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in <module>
[default1]: trainer.train(dataloader)
[default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train
[default1]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader)
[default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step
[default1]: outputs = self.pipeline_engine.train_batch_iter(
[default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter
[default1]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model)
[default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward
[default1]: output = model(**micro_batch)
[default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
[default1]: return self._call_impl(*args, **kwargs)
[default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
[default1]: return forward_call(*args, **kwargs)
[default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward
[default1]: sharded_logits = self.model(
[default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
[default1]: return self._call_impl(*args, **kwargs)
[default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
[default1]: return forward_call(*args, **kwargs)
[default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward
[default1]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0]
[default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states
[default1]: hidden_encoder_states = encoder_block(**hidden_encoder_states)
[default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
[default1]: return self._call_impl(*args, **kwargs)
[default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
[default1]: return forward_call(*args, **kwargs)
[default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward
[default1]: output = self.pp_block(**new_kwargs)
[default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
[default1]: return self._call_impl(*args, **kwargs)
[default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
[default1]: return forward_call(*args, **kwargs)
[default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 636, in forward
[default1]: hidden_states = self.mlp(hidden_states=hidden_states)["hidden_states"]
[default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
[default1]: return self._call_impl(*args, **kwargs)
[default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
[default1]: return forward_call(*args, **kwargs)
[default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 171, in forward
[default1]: hidden_states = self.down_proj(self.split_silu_mul(merged_states))
[default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
[default1]: return self._call_impl(*args, **kwargs)
[default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
[default1]: return forward_call(*args, **kwargs)
[default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 127, in forward
[default1]: return self.act(gate_states) * up_states
[default1]:torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.00 MiB. GPU 1 has a total capacity of 79.33 GiB of which 15.94 MiB is free. Including non-PyTorch memory, this process has 79.30 GiB memory in use. Of the allocated memory 69.01 GiB is allocated by PyTorch, and 1.04 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[default2]:Traceback (most recent call last):
[default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in <module>
[default4]:Traceback (most recent call last):
[default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in <module>
[default5]:Traceback (most recent call last):
[default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in <module>
[default4]: trainer.train(dataloader)
[default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train
[default2]: trainer.train(dataloader)
[default5]: trainer.train(dataloader)
[default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train
[default4]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader)
[default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step
[default5]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader)
[default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step
[default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train
[default2]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader)
[default4]: outputs = self.pipeline_engine.train_batch_iter(
[default5]: outputs = self.pipeline_engine.train_batch_iter(
[default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step
[default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter
[default4]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model)
[default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter
[default2]: outputs = self.pipeline_engine.train_batch_iter(
[default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward
[default5]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model)
[default4]: output = model(**micro_batch)
[default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward
[default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter
[default5]: output = model(**micro_batch)
[default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
[default4]: return self._call_impl(*args, **kwargs)
[default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
[default5]: return self._call_impl(*args, **kwargs)
[default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
[default2]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model)
[default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
[default4]: return forward_call(*args, **kwargs)
[default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward
[default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward
[default4]: sharded_logits = self.model(
[default5]: return forward_call(*args, **kwargs)
[default2]: output = model(**micro_batch)
[default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
[default4]: return self._call_impl(*args, **kwargs)
[default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
[default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward
[default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
[default5]: sharded_logits = self.model(
[default4]: return forward_call(*args, **kwargs)
[default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
[default5]: return self._call_impl(*args, **kwargs)
[default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
[default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward
[default2]: return self._call_impl(*args, **kwargs)
[default4]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0]
[default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
[default2]: return forward_call(*args, **kwargs)
[default5]: return forward_call(*args, **kwargs)
[default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward
[default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward
[default2]: sharded_logits = self.model(
[default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states
[default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
[default5]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0]
[default4]: hidden_encoder_states = encoder_block(**hidden_encoder_states)
[default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states
[default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
[default5]: hidden_encoder_states = encoder_block(**hidden_encoder_states)
[default2]: return self._call_impl(*args, **kwargs)
[default4]: return self._call_impl(*args, **kwargs)
[default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
[default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
[default5]: return self._call_impl(*args, **kwargs)
[default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
[default4]: return forward_call(*args, **kwargs)
[default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
[default2]: return forward_call(*args, **kwargs)
[default5]: return forward_call(*args, **kwargs)
[default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward
[default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward
[default4]: output = self.pp_block(**new_kwargs)
[default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward
[default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
[default2]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0]
[default5]: output = self.pp_block(**new_kwargs)
[default4]: return self._call_impl(*args, **kwargs)
[default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
[default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
[default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states
[default2]: hidden_encoder_states = encoder_block(**hidden_encoder_states)
[default4]: return forward_call(*args, **kwargs)
[default5]: return self._call_impl(*args, **kwargs)
[default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 636, in forward
[default4]: hidden_states = self.mlp(hidden_states=hidden_states)["hidden_states"]
[default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
[default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
[default2]: return self._call_impl(*args, **kwargs)
[default4]: return self._call_impl(*args, **kwargs)
[default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
[default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
[default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
[default2]: return forward_call(*args, **kwargs)
[default5]: return forward_call(*args, **kwargs)
[default4]: return forward_call(*args, **kwargs)
[default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward
[default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 636, in forward
[default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 171, in forward
[default2]: output = self.pp_block(**new_kwargs)
[default5]: hidden_states = self.mlp(hidden_states=hidden_states)["hidden_states"]
[default4]: hidden_states = self.down_proj(self.split_silu_mul(merged_states))
[default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
[default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
[default2]: return self._call_impl(*args, **kwargs)
[default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
[default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
[default4]: return self._call_impl(*args, **kwargs)
[default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
[default5]: return self._call_impl(*args, **kwargs)
[default2]: return forward_call(*args, **kwargs)
[default4]: return forward_call(*args, **kwargs)
[default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 630, in forward
[default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
[default2]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask)
[default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
[default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 127, in forward
[default4]: return self.act(gate_states) * up_states
[default2]: return self._call_impl(*args, **kwargs)
[default5]: return forward_call(*args, **kwargs)
[default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 171, in forward
[default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
[default2]: return forward_call(*args, **kwargs)
[default4]:torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.00 MiB. GPU 4 has a total capacity of 79.33 GiB of which 1.94 MiB is free. Including non-PyTorch memory, this process has 79.31 GiB memory in use. Of the allocated memory 68.75 GiB is allocated by PyTorch, and 1.05 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 597, in forward
[default2]: output = self.o_proj(attention_output)
[default5]: hidden_states = self.down_proj(self.split_silu_mul(merged_states))
[default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
[default2]: return self._call_impl(*args, **kwargs)
[default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
[default2]: return forward_call(*args, **kwargs)
[default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
[default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward
[default2]: return row_linear(
[default5]: return self._call_impl(*args, **kwargs)
[default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
[default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 474, in row_linear
[default5]: return forward_call(*args, **kwargs)
[default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 127, in forward
[default2]: out = F.linear(input, weight, bias)
[default2]:torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 64.00 MiB. GPU 2 has a total capacity of 79.33 GiB of which 1.94 MiB is free. Including non-PyTorch memory, this process has 79.31 GiB memory in use. Of the allocated memory 68.91 GiB is allocated by PyTorch, and 1.05 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[default5]: return self.act(gate_states) * up_states
[default5]:torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.00 MiB. GPU 5 has a total capacity of 79.33 GiB of which 1.94 MiB is free. Including non-PyTorch memory, this process has 79.31 GiB memory in use. Of the allocated memory 68.75 GiB is allocated by PyTorch, and 1.05 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[default7]:Traceback (most recent call last):
[default6]:Traceback (most recent call last):
[default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in <module>
[default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in <module>
[default7]: trainer.train(dataloader)
[default6]: trainer.train(dataloader)
[default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train
[default6]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader)
[default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step
[default6]: outputs = self.pipeline_engine.train_batch_iter(
[default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train
[default7]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader)
[default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step
[default7]: outputs = self.pipeline_engine.train_batch_iter(
[default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter
[default7]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model)
[default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward
[default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter
[default6]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model)
[default7]: output = model(**micro_batch)
[default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward
[default6]: output = model(**micro_batch)
[default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
[default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
[default7]: return self._call_impl(*args, **kwargs)
[default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
[default7]: return forward_call(*args, **kwargs)
[default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward
[default6]: return self._call_impl(*args, **kwargs)
[default7]: sharded_logits = self.model(
[default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
[default7]: return self._call_impl(*args, **kwargs)
[default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
[default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
[default6]: return forward_call(*args, **kwargs)
[default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward
[default6]: sharded_logits = self.model(
[default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
[default6]: return self._call_impl(*args, **kwargs)
[default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
[default6]: return forward_call(*args, **kwargs)
[default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward
[default6]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0]
[default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states
[default6]: hidden_encoder_states = encoder_block(**hidden_encoder_states)
[default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
[default6]: return self._call_impl(*args, **kwargs)
[default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
[default6]: return forward_call(*args, **kwargs)
[default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward
[default6]: output = self.pp_block(**new_kwargs)
[default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
[default6]: return self._call_impl(*args, **kwargs)
[default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
[default6]: return forward_call(*args, **kwargs)
[default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 636, in forward
[default6]: hidden_states = self.mlp(hidden_states=hidden_states)["hidden_states"]
[default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
[default6]: return self._call_impl(*args, **kwargs)
[default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
[default6]: return forward_call(*args, **kwargs)
[default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 171, in forward
[default6]: hidden_states = self.down_proj(self.split_silu_mul(merged_states))
[default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
[default6]: return self._call_impl(*args, **kwargs)
[default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
[default6]: return forward_call(*args, **kwargs)
[default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 127, in forward
[default6]: return self.act(gate_states) * up_states
[default6]:torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.00 MiB. GPU 6 has a total capacity of 79.33 GiB of which 1.94 MiB is free. Including non-PyTorch memory, this process has 79.31 GiB memory in use. Of the allocated memory 68.75 GiB is allocated by PyTorch, and 1.05 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[default0]:STAGE:2024-07-06 09:42:10 191688:191688 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
[default7]: return forward_call(*args, **kwargs)
[default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward
[default7]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0]
[default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states
[default7]: hidden_encoder_states = encoder_block(**hidden_encoder_states)
[default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
[default7]: return self._call_impl(*args, **kwargs)
[default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
[default7]: return forward_call(*args, **kwargs)
[default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward
[default7]: output = self.pp_block(**new_kwargs)
[default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
[default7]: return self._call_impl(*args, **kwargs)
[default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
[default7]: return forward_call(*args, **kwargs)
[default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 636, in forward
[default7]: hidden_states = self.mlp(hidden_states=hidden_states)["hidden_states"]
[default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
[default7]: return self._call_impl(*args, **kwargs)
[default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
[default7]: return forward_call(*args, **kwargs)
[default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 171, in forward
[default7]: hidden_states = self.down_proj(self.split_silu_mul(merged_states))
[default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
[default7]: return self._call_impl(*args, **kwargs)
[default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
[default7]: return forward_call(*args, **kwargs)
[default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward
[default7]: return row_linear(
[default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 474, in row_linear
[default7]: out = F.linear(input, weight, bias)
[default7]:torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 64.00 MiB. GPU 7 has a total capacity of 79.33 GiB of which 17.94 MiB is free. Including non-PyTorch memory, this process has 79.30 GiB memory in use. Of the allocated memory 69.03 GiB is allocated by PyTorch, and 1.04 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[2024-07-06 09:42:15,488] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 191688 closing signal SIGTERM
[2024-07-06 09:42:15,488] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 191691 closing signal SIGTERM
[2024-07-06 09:42:17,255] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 191689) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10
Traceback (most recent call last):
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2024-07-06_09:42:15
host : ip-26-0-161-221.ec2.internal
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 191690)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2024-07-06_09:42:15
host : ip-26-0-161-221.ec2.internal
rank : 4 (local_rank: 4)
exitcode : 1 (pid: 191692)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2024-07-06_09:42:15
host : ip-26-0-161-221.ec2.internal
rank : 5 (local_rank: 5)
exitcode : 1 (pid: 191693)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[4]:
time : 2024-07-06_09:42:15
host : ip-26-0-161-221.ec2.internal
rank : 6 (local_rank: 6)
exitcode : 1 (pid: 191694)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[5]:
time : 2024-07-06_09:42:15
host : ip-26-0-161-221.ec2.internal
rank : 7 (local_rank: 7)
exitcode : 1 (pid: 191695)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-07-06_09:42:15
host : ip-26-0-161-221.ec2.internal
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 191689)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
srun: error: ip-26-0-161-221: task 0: Exited with exit code 1
[2024-07-06 09:42:19,538] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-166-214.ec2.internal_195060_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError.
[2024-07-06 09:42:19,578] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-162-180.ec2.internal_234975_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError.
[2024-07-06 09:42:20,279] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-166-244.ec2.internal_4131861_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError.
[2024-07-06 09:42:20,376] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-162-46.ec2.internal_1309173_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError.
[2024-07-06 09:42:20,378] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-166-125.ec2.internal_174414_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError.
[2024-07-06 09:42:20,435] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-162-79.ec2.internal_1048186_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError.
[2024-07-06 09:42:20,449] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-162-14.ec2.internal_1204653_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError.
[2024-07-06 09:42:20,487] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1048257 closing signal SIGTERM
[2024-07-06 09:42:20,487] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1048258 closing signal SIGTERM
[2024-07-06 09:42:20,488] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1048259 closing signal SIGTERM
[2024-07-06 09:42:20,490] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1048260 closing signal SIGTERM
[2024-07-06 09:42:20,489] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 4131932 closing signal SIGTERM
[2024-07-06 09:42:20,490] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1048261 closing signal SIGTERM
[2024-07-06 09:42:20,490] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 4131933 closing signal SIGTERM
[2024-07-06 09:42:20,491] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 4131934 closing signal SIGTERM
[2024-07-06 09:42:20,491] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1048262 closing signal SIGTERM
[2024-07-06 09:42:20,490] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 174485 closing signal SIGTERM
[2024-07-06 09:42:20,491] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1309243 closing signal SIGTERM
[2024-07-06 09:42:20,491] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 4131935 closing signal SIGTERM
[2024-07-06 09:42:20,491] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1309244 closing signal SIGTERM
[2024-07-06 09:42:20,492] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1048263 closing signal SIGTERM
[2024-07-06 09:42:20,492] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1048264 closing signal SIGTERM
[2024-07-06 09:42:20,492] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 4131936 closing signal SIGTERM
[2024-07-06 09:42:20,491] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 174486 closing signal SIGTERM
[2024-07-06 09:42:20,491] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 174487 closing signal SIGTERM
[2024-07-06 09:42:20,491] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 174488 closing signal SIGTERM
[2024-07-06 09:42:20,493] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1204725 closing signal SIGTERM
[2024-07-06 09:42:20,493] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1309245 closing signal SIGTERM
[2024-07-06 09:42:20,493] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1204726 closing signal SIGTERM
[2024-07-06 09:42:20,493] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 174489 closing signal SIGTERM
[2024-07-06 09:42:20,493] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 174490 closing signal SIGTERM
[2024-07-06 09:42:20,494] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1309246 closing signal SIGTERM
[2024-07-06 09:42:20,494] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 174491 closing signal SIGTERM
[2024-07-06 09:42:20,494] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1204727 closing signal SIGTERM
[2024-07-06 09:42:20,494] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 174492 closing signal SIGTERM
[2024-07-06 09:42:20,496] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1309247 closing signal SIGTERM
[2024-07-06 09:42:20,496] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1204728 closing signal SIGTERM
[2024-07-06 09:42:20,496] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 4131937 closing signal SIGTERM
[2024-07-06 09:42:20,497] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1204729 closing signal SIGTERM
[2024-07-06 09:42:20,497] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1204730 closing signal SIGTERM
[2024-07-06 09:42:20,496] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1309248 closing signal SIGTERM
[2024-07-06 09:42:20,496] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1309249 closing signal SIGTERM
[2024-07-06 09:42:20,497] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 4131938 closing signal SIGTERM
[2024-07-06 09:42:20,497] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1204731 closing signal SIGTERM
[2024-07-06 09:42:20,498] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1204732 closing signal SIGTERM
[2024-07-06 09:42:20,500] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 195129 closing signal SIGTERM
[2024-07-06 09:42:20,500] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 195130 closing signal SIGTERM
[2024-07-06 09:42:20,498] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1309250 closing signal SIGTERM
[2024-07-06 09:42:20,499] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 4131939 closing signal SIGTERM
[2024-07-06 09:42:20,499] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 235045 closing signal SIGTERM
[2024-07-06 09:42:20,499] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 235046 closing signal SIGTERM
[2024-07-06 09:42:20,501] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 195131 closing signal SIGTERM
[2024-07-06 09:42:20,500] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 235047 closing signal SIGTERM
[2024-07-06 09:42:20,502] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 195132 closing signal SIGTERM
[2024-07-06 09:42:20,503] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 195133 closing signal SIGTERM
[2024-07-06 09:42:20,503] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 195134 closing signal SIGTERM
[2024-07-06 09:42:20,503] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 235048 closing signal SIGTERM
[2024-07-06 09:42:20,506] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 195135 closing signal SIGTERM
[2024-07-06 09:42:20,504] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 235049 closing signal SIGTERM
[2024-07-06 09:42:20,506] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 195136 closing signal SIGTERM
[2024-07-06 09:42:20,505] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 235050 closing signal SIGTERM
[2024-07-06 09:42:20,507] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 235051 closing signal SIGTERM
[2024-07-06 09:42:20,507] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 235052 closing signal SIGTERM
[2024-07-06 09:42:23,844] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-166-244.ec2.internal_4131861_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError.
Traceback (most recent call last):
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store
return getattr(self._store, store_op)(*args, **kwargs)
torch.distributed.DistNetworkError: Broken pipe
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
result = agent.run()
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
result = f(*args, **kwargs)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 727, in run
result = self._invoke_run(role)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 900, in _invoke_run
num_nodes_waiting = rdzv_handler.num_nodes_waiting()
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1083, in num_nodes_waiting
self._state_holder.sync()
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 409, in sync
get_response = self._backend.get_state()
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state
base64_state: bytes = self._call_store("get", self._key)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store
raise RendezvousConnectionError(
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.
srun: error: ip-26-0-166-244: task 7: Exited with exit code 1
[2024-07-06 09:42:24,542] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-166-214.ec2.internal_195060_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError.
[2024-07-06 09:42:24,583] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-162-180.ec2.internal_234975_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError.
[2024-07-06 09:42:25,381] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-162-46.ec2.internal_1309173_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError.
[2024-07-06 09:42:25,383] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-166-125.ec2.internal_174414_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError.
[2024-07-06 09:42:25,434] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-166-125.ec2.internal_174414_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError.
Traceback (most recent call last):
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store
[2024-07-06 09:42:25,435] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-162-14.ec2.internal_1204653_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError.
Traceback (most recent call last):
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store
return getattr(self._store, store_op)(*args, **kwargs)
torch.distributed.DistNetworkError: Broken pipe
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
return getattr(self._store, store_op)(*args, **kwargs)
torch.distributed.DistNetworkError: Broken pipe
The above exception was the direct cause of the following exception:
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in <module>
Traceback (most recent call last):
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in <module>
sys.exit(main())
sys.exit(main())
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
return f(*args, **kwargs)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
run(args)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
elastic_launch(
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
[2024-07-06 09:42:25,440] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-162-79.ec2.internal_1048186_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError.
result = agent.run()
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
return launch_agent(self._config, self._entrypoint, list(args))
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
result = f(*args, **kwargs)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 727, in run
result = agent.run()
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
result = self._invoke_run(role)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 900, in _invoke_run
result = f(*args, **kwargs)
num_nodes_waiting = rdzv_handler.num_nodes_waiting()
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1083, in num_nodes_waiting
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 727, in run
self._state_holder.sync()
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 409, in sync
result = self._invoke_run(role)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 900, in _invoke_run
get_response = self._backend.get_state()
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state
num_nodes_waiting = rdzv_handler.num_nodes_waiting()
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1083, in num_nodes_waiting
base64_state: bytes = self._call_store("get", self._key)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store
self._state_holder.sync()
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 409, in sync
raise RendezvousConnectionError(
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.
get_response = self._backend.get_state()
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state
base64_state: bytes = self._call_store("get", self._key)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store
raise RendezvousConnectionError(
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.
[2024-07-06 09:42:25,529] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-162-180.ec2.internal_234975_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError.
Traceback (most recent call last):
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store
return getattr(self._store, store_op)(*args, **kwargs)
torch.distributed.DistNetworkError: Broken pipe
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
result = agent.run()
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
result = f(*args, **kwargs)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 727, in run
result = self._invoke_run(role)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 900, in _invoke_run
num_nodes_waiting = rdzv_handler.num_nodes_waiting()
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1083, in num_nodes_waiting
self._state_holder.sync()
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 409, in sync
get_response = self._backend.get_state()
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state
base64_state: bytes = self._call_store("get", self._key)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store
raise RendezvousConnectionError(
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.
srun: error: ip-26-0-162-14: task 1: Exited with exit code 1
srun: error: ip-26-0-166-125: task 5: Exited with exit code 1
srun: error: ip-26-0-162-180: task 4: Exited with exit code 1
[2024-07-06 09:42:25,929] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-162-79.ec2.internal_1048186_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError.
Traceback (most recent call last):
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store
return getattr(self._store, store_op)(*args, **kwargs)
torch.distributed.DistNetworkError: Broken pipe
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
result = agent.run()
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
result = f(*args, **kwargs)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 727, in run
result = self._invoke_run(role)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 900, in _invoke_run
num_nodes_waiting = rdzv_handler.num_nodes_waiting()
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1083, in num_nodes_waiting
self._state_holder.sync()
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 409, in sync
get_response = self._backend.get_state()
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state
base64_state: bytes = self._call_store("get", self._key)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store
raise RendezvousConnectionError(
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.
[2024-07-06 09:42:26,039] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-166-214.ec2.internal_195060_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError.
Traceback (most recent call last):
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store
return getattr(self._store, store_op)(*args, **kwargs)
torch.distributed.DistNetworkError: Broken pipe
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
result = agent.run()
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
result = f(*args, **kwargs)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 727, in run
result = self._invoke_run(role)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 900, in _invoke_run
num_nodes_waiting = rdzv_handler.num_nodes_waiting()
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1083, in num_nodes_waiting
self._state_holder.sync()
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 409, in sync
get_response = self._backend.get_state()
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state
base64_state: bytes = self._call_store("get", self._key)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store
raise RendezvousConnectionError(
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.
srun: error: ip-26-0-162-79: task 3: Exited with exit code 1
[2024-07-06 09:42:26,329] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-162-46.ec2.internal_1309173_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError.
Traceback (most recent call last):
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store
return getattr(self._store, store_op)(*args, **kwargs)
torch.distributed.DistNetworkError: Broken pipe
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
result = agent.run()
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
result = f(*args, **kwargs)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 727, in run
result = self._invoke_run(role)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 900, in _invoke_run
num_nodes_waiting = rdzv_handler.num_nodes_waiting()
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1083, in num_nodes_waiting
self._state_holder.sync()
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 409, in sync
get_response = self._backend.get_state()
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state
base64_state: bytes = self._call_store("get", self._key)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store
raise RendezvousConnectionError(
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.
srun: error: ip-26-0-166-214: task 6: Exited with exit code 1
srun: error: ip-26-0-162-46: task 2: Exited with exit code 1
Consider using `hf_transfer` for faster uploads. This solution comes with some limitations. See https://huggingface.co/docs/huggingface_hub/hf_transfer for more details.