======================== START TIME: Sat Jul 6 09:33:59 UTC 2024 python3 version = Python 3.10.14 ======================== The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well. Token is valid (permission: write). Your token has been saved to /admin/home/ferdinand_mom/.cache/huggingface/token Login successful Already on 'bench_cluster' M examples/config_tiny_llama.py M examples/config_tiny_llama.yaml M examples/train_tiny_llama.sh Your branch is up to date with 'origin/bench_cluster'. Job status: RUNNING [2024-07-06 09:34:01,707] torch.distributed.run: [WARNING] [2024-07-06 09:34:01,707] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:34:01,707] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-07-06 09:34:01,707] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:34:01,715] torch.distributed.run: [WARNING] [2024-07-06 09:34:01,715] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:34:01,715] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-07-06 09:34:01,715] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:34:01,721] torch.distributed.run: [WARNING] [2024-07-06 09:34:01,721] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:34:01,721] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-07-06 09:34:01,721] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:34:01,724] torch.distributed.run: [WARNING] [2024-07-06 09:34:01,724] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:34:01,724] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-07-06 09:34:01,724] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:34:01,740] torch.distributed.run: [WARNING] [2024-07-06 09:34:01,740] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:34:01,740] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-07-06 09:34:01,740] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:34:01,738] torch.distributed.run: [WARNING] [2024-07-06 09:34:01,738] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:34:01,738] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-07-06 09:34:01,738] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:34:01,774] torch.distributed.run: [WARNING] [2024-07-06 09:34:01,774] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:34:01,774] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-07-06 09:34:01,774] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:34:01,777] torch.distributed.run: [WARNING] [2024-07-06 09:34:01,777] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:34:01,777] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-07-06 09:34:01,777] torch.distributed.run: [WARNING] ***************************************** [default0]:07/06/2024 09:34:20 [WARNING|DP=0|PP=0|TP=0|ip-26-0-160-192]: [Vocab Size Padding] Padded vocab (size: 50257) with 1 dummy tokens (new size: 50258) [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: Config: [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: Config(general=GeneralArgs(project='bench_cluster', [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: run='%date_%jobid', [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: seed=42, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: step=None, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: consumed_train_samples=None, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: benchmark_csv_path=None, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: ignore_sanity_checks=True), [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: parallelism=ParallelismArgs(dp=16, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: pp=2, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: tp=2, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: pp_engine=, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: tp_mode=, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: tp_linear_async_communication=False, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: expert_parallel_size=1), [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: model=ModelArgs(model_config=LlamaConfig(bos_token_id=1, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: eos_token_id=2, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: hidden_act='silu', [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: hidden_size=2048, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: initializer_range=0.02, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: intermediate_size=4096, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: is_llama_config=True, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: max_position_embeddings=4096, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: num_attention_heads=32, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: num_hidden_layers=24, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: num_key_value_heads=32, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: pad_token_id=None, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: pretraining_tp=1, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: rms_norm_eps=1e-05, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: rope_scaling=None, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: rope_theta=10000.0, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: tie_word_embeddings=True, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: use_cache=True, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: vocab_size=50258), [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: init_method=RandomInit(std=0.025), [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: dtype=torch.bfloat16, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: make_vocab_size_divisible_by=1, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: ddp_bucket_cap_mb=25), [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: tokenizer=TokenizerArgs(tokenizer_name_or_path='openai-community/gpt2', [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: tokenizer_revision=None, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: tokenizer_max_length=None), [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: checkpoints=CheckpointsArgs(checkpoints_path=PosixPath('/dev/null'), [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: checkpoint_interval=100000, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: save_initial_state=False, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: resume_checkpoint_path=None, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: checkpoints_path_is_shared_file_system=False), [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: logging=LoggingArgs(log_level='info', [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: log_level_replica='info', [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: iteration_step_info_interval=1), [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: tokens=TokensArgs(sequence_length=4096, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: train_steps=20, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: micro_batch_size=2, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: batch_accumulation_per_replica=32, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: val_check_interval=-1, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: limit_val_batches=0, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: limit_test_batches=0), [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: optimizer=OptimizerArgs(optimizer_factory=AdamWOptimizerArgs(adam_eps=1e-08, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: adam_beta1=0.9, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: adam_beta2=0.95, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: torch_adam_is_fused=True, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: name='adamW'), [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: zero_stage=1, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: weight_decay=0.01, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: clip_grad=1.0, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: accumulate_grad_in_fp32=True, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: learning_rate_scheduler=LRSchedulerArgs(learning_rate=0.0001, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: lr_warmup_steps=1, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: lr_warmup_style='linear', [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: lr_decay_style='linear', [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: lr_decay_steps=19, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: lr_decay_starting_step=None, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: min_decay_lr=1e-05)), [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: data_stages=[DatasetStageArgs(name='Training Stage', [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: start_training_step=1, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: data=DataArgs(dataset=PretrainDatasetsArgs(hf_dataset_or_datasets='roneneldan/TinyStories', [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: hf_dataset_splits='train', [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: hf_dataset_config_name=None, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: dataset_processing_num_proc_per_process=64, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: dataset_overwrite_cache=False, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: text_column_name='text'), [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: seed=42, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: num_loading_workers=0))], [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: profiler=ProfilerArgs(profiler_export_path=PosixPath('/fsx/ferdinandmom/ferdinand-hf/bench_cluster/results/llama-1B/64_GPUS/dp-16_tp-2_pp-2_mbz-2')), [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: lighteval=None) [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: Model Config: [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: LlamaConfig(bos_token_id=1, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: eos_token_id=2, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: hidden_act='silu', [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: hidden_size=2048, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: initializer_range=0.02, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: intermediate_size=4096, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: is_llama_config=True, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: max_position_embeddings=4096, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: num_attention_heads=32, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: num_hidden_layers=24, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: num_key_value_heads=32, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: pad_token_id=None, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: pretraining_tp=1, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: rms_norm_eps=1e-05, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: rope_scaling=None, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: rope_theta=10000.0, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: tie_word_embeddings=True, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: use_cache=True, [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: vocab_size=50258) [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: Building model.. [default0]:07/06/2024 09:34:20 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: Setting PP block ranks... [default0]:07/06/2024 09:34:32 [INFO|DP=12|PP=0|TP=0|ip-26-0-163-134]: No checkpoint path provided. [default0]:07/06/2024 09:34:32 [INFO|DP=12|PP=1|TP=0|ip-26-0-174-36]: No checkpoint path provided. [default1]:07/06/2024 09:34:32 [INFO|DP=12|PP=0|TP=1|ip-26-0-163-134]: No checkpoint path provided. [default1]:07/06/2024 09:34:32 [INFO|DP=12|PP=1|TP=1|ip-26-0-174-36]: No checkpoint path provided. [default3]:07/06/2024 09:34:32 [INFO|DP=13|PP=1|TP=1|ip-26-0-174-36]: No checkpoint path provided. [default3]:07/06/2024 09:34:32 [INFO|DP=13|PP=0|TP=1|ip-26-0-163-134]: No checkpoint path provided. [default2]:07/06/2024 09:34:32 [INFO|DP=13|PP=0|TP=0|ip-26-0-163-134]: No checkpoint path provided. [default2]:07/06/2024 09:34:32 [INFO|DP=13|PP=1|TP=0|ip-26-0-174-36]: No checkpoint path provided. [default0]:07/06/2024 09:34:32 [INFO|DP=0|PP=1|TP=0|ip-26-0-165-131]: Local number of parameters: 261M (498.24MiB) [default0]:07/06/2024 09:34:32 [INFO|DP=0|PP=1|TP=0|ip-26-0-165-131]: [After model building] Memory usage: 508.26MiB. Peak allocated: 510.29MiB Peak reserved: 526.00MiB [default0]:07/06/2024 09:34:32 [INFO|DP=0|PP=1|TP=0|ip-26-0-165-131]: No checkpoint path provided. [default1]:07/06/2024 09:34:32 [INFO|DP=0|PP=1|TP=1|ip-26-0-165-131]: Local number of parameters: 261M (498.24MiB) [default1]:07/06/2024 09:34:32 [INFO|DP=0|PP=1|TP=1|ip-26-0-165-131]: [After model building] Memory usage: 508.26MiB. Peak allocated: 510.29MiB Peak reserved: 526.00MiB [default1]:07/06/2024 09:34:32 [INFO|DP=0|PP=1|TP=1|ip-26-0-165-131]: No checkpoint path provided. [default2]:07/06/2024 09:34:32 [INFO|DP=9|PP=0|TP=0|ip-26-0-161-78]: No checkpoint path provided. [default3]:07/06/2024 09:34:32 [INFO|DP=9|PP=0|TP=1|ip-26-0-161-78]: No checkpoint path provided. [default0]:07/06/2024 09:34:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: Total number of parameters: 1.21G (2313.02MiB) [default0]:07/06/2024 09:34:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: Local number of parameters: 345M (658.27MiB) [default0]:07/06/2024 09:34:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: [After model building] Memory usage: 672.29MiB. Peak allocated: 674.32MiB Peak reserved: 690.00MiB [default0]:07/06/2024 09:34:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: No checkpoint path provided. [default0]:07/06/2024 09:34:32 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: Parametrizing model parameters using StandardParametrizator [default2]:07/06/2024 09:34:32 [INFO|DP=9|PP=1|TP=0|ip-26-0-173-246]: No checkpoint path provided. [default1]:07/06/2024 09:34:32 [INFO|DP=0|PP=0|TP=1|ip-26-0-160-192]: Local number of parameters: 345M (658.27MiB) [default3]:07/06/2024 09:34:32 [INFO|DP=9|PP=1|TP=1|ip-26-0-173-246]: No checkpoint path provided. [default1]:07/06/2024 09:34:32 [INFO|DP=0|PP=0|TP=1|ip-26-0-160-192]: [After model building] Memory usage: 672.29MiB. Peak allocated: 674.32MiB Peak reserved: 690.00MiB [default1]:07/06/2024 09:34:32 [INFO|DP=0|PP=0|TP=1|ip-26-0-160-192]: No checkpoint path provided. [default3]:07/06/2024 09:34:32 [INFO|DP=1|PP=1|TP=1|ip-26-0-165-131]: No checkpoint path provided. [default1]:07/06/2024 09:34:32 [INFO|DP=8|PP=0|TP=1|ip-26-0-161-78]: No checkpoint path provided. [default2]:07/06/2024 09:34:32 [INFO|DP=1|PP=1|TP=0|ip-26-0-165-131]: No checkpoint path provided. [default0]:07/06/2024 09:34:32 [INFO|DP=8|PP=0|TP=0|ip-26-0-161-78]: No checkpoint path provided. [default1]:07/06/2024 09:34:32 [INFO|DP=8|PP=1|TP=1|ip-26-0-173-246]: No checkpoint path provided. [default4]:07/06/2024 09:34:32 [INFO|DP=14|PP=0|TP=0|ip-26-0-163-134]: No checkpoint path provided. [default0]:07/06/2024 09:34:32 [INFO|DP=8|PP=1|TP=0|ip-26-0-173-246]: No checkpoint path provided. [default6]:07/06/2024 09:34:32 [INFO|DP=15|PP=0|TP=0|ip-26-0-163-134]: No checkpoint path provided. [default7]:07/06/2024 09:34:32 [INFO|DP=15|PP=0|TP=1|ip-26-0-163-134]: No checkpoint path provided. [default2]:07/06/2024 09:34:32 [INFO|DP=1|PP=0|TP=0|ip-26-0-160-192]: No checkpoint path provided. [default3]:07/06/2024 09:34:32 [INFO|DP=1|PP=0|TP=1|ip-26-0-160-192]: No checkpoint path provided. [default5]:07/06/2024 09:34:32 [INFO|DP=14|PP=0|TP=1|ip-26-0-163-134]: No checkpoint path provided. [default5]:07/06/2024 09:34:32 [INFO|DP=14|PP=1|TP=1|ip-26-0-174-36]: No checkpoint path provided. [default7]:07/06/2024 09:34:32 [INFO|DP=15|PP=1|TP=1|ip-26-0-174-36]: No checkpoint path provided. [default4]:07/06/2024 09:34:32 [INFO|DP=14|PP=1|TP=0|ip-26-0-174-36]: No checkpoint path provided. [default6]:07/06/2024 09:34:32 [INFO|DP=15|PP=1|TP=0|ip-26-0-174-36]: No checkpoint path provided. [default7]:07/06/2024 09:34:33 [INFO|DP=3|PP=1|TP=1|ip-26-0-165-131]: No checkpoint path provided. [default6]:07/06/2024 09:34:33 [INFO|DP=3|PP=1|TP=0|ip-26-0-165-131]: No checkpoint path provided. [default5]:07/06/2024 09:34:33 [INFO|DP=2|PP=1|TP=1|ip-26-0-165-131]: No checkpoint path provided. [default4]:07/06/2024 09:34:33 [INFO|DP=2|PP=1|TP=0|ip-26-0-165-131]: No checkpoint path provided. [default4]:07/06/2024 09:34:33 [INFO|DP=10|PP=0|TP=0|ip-26-0-161-78]: No checkpoint path provided. [default6]:07/06/2024 09:34:32 [INFO|DP=11|PP=0|TP=0|ip-26-0-161-78]: No checkpoint path provided. [default7]:07/06/2024 09:34:33 [INFO|DP=3|PP=0|TP=1|ip-26-0-160-192]: No checkpoint path provided. [default5]:07/06/2024 09:34:33 [INFO|DP=10|PP=0|TP=1|ip-26-0-161-78]: No checkpoint path provided. [default6]:07/06/2024 09:34:32 [INFO|DP=11|PP=1|TP=0|ip-26-0-173-246]: No checkpoint path provided. [default7]:07/06/2024 09:34:32 [INFO|DP=11|PP=0|TP=1|ip-26-0-161-78]: No checkpoint path provided. [default5]:07/06/2024 09:34:33 [INFO|DP=2|PP=0|TP=1|ip-26-0-160-192]: No checkpoint path provided. [default4]:07/06/2024 09:34:33 [INFO|DP=2|PP=0|TP=0|ip-26-0-160-192]: No checkpoint path provided. [default4]:07/06/2024 09:34:33 [INFO|DP=10|PP=1|TP=0|ip-26-0-173-246]: No checkpoint path provided. [default0]:07/06/2024 09:34:32 [INFO|DP=4|PP=1|TP=0|ip-26-0-165-59]: No checkpoint path provided. [default0]:07/06/2024 09:34:32 [INFO|DP=4|PP=0|TP=0|ip-26-0-161-142]: No checkpoint path provided. [default5]:07/06/2024 09:34:33 [INFO|DP=10|PP=1|TP=1|ip-26-0-173-246]: No checkpoint path provided. [default1]:07/06/2024 09:34:32 [INFO|DP=4|PP=0|TP=1|ip-26-0-161-142]: No checkpoint path provided. [default6]:07/06/2024 09:34:33 [INFO|DP=3|PP=0|TP=0|ip-26-0-160-192]: No checkpoint path provided. [default1]:07/06/2024 09:34:32 [INFO|DP=4|PP=1|TP=1|ip-26-0-165-59]: No checkpoint path provided. [default7]:07/06/2024 09:34:32 [INFO|DP=11|PP=1|TP=1|ip-26-0-173-246]: No checkpoint path provided. [default3]:07/06/2024 09:34:33 [INFO|DP=5|PP=0|TP=1|ip-26-0-161-142]: No checkpoint path provided. [default2]:07/06/2024 09:34:33 [INFO|DP=5|PP=0|TP=0|ip-26-0-161-142]: No checkpoint path provided. [default3]:07/06/2024 09:34:33 [INFO|DP=5|PP=1|TP=1|ip-26-0-165-59]: No checkpoint path provided. [default2]:07/06/2024 09:34:33 [INFO|DP=5|PP=1|TP=0|ip-26-0-165-59]: No checkpoint path provided. [default7]:07/06/2024 09:34:33 [INFO|DP=7|PP=0|TP=1|ip-26-0-161-142]: No checkpoint path provided. [default4]:07/06/2024 09:34:33 [INFO|DP=6|PP=0|TP=0|ip-26-0-161-142]: No checkpoint path provided. [default5]:07/06/2024 09:34:33 [INFO|DP=6|PP=0|TP=1|ip-26-0-161-142]: No checkpoint path provided. [default6]:07/06/2024 09:34:33 [INFO|DP=7|PP=0|TP=0|ip-26-0-161-142]: No checkpoint path provided. [default7]:07/06/2024 09:34:33 [INFO|DP=7|PP=1|TP=1|ip-26-0-165-59]: No checkpoint path provided. [default5]:07/06/2024 09:34:33 [INFO|DP=6|PP=1|TP=1|ip-26-0-165-59]: No checkpoint path provided. [default4]:07/06/2024 09:34:33 [INFO|DP=6|PP=1|TP=0|ip-26-0-165-59]: No checkpoint path provided. [default6]:07/06/2024 09:34:33 [INFO|DP=7|PP=1|TP=0|ip-26-0-165-59]: No checkpoint path provided. [default0]:07/06/2024 09:34:37 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: [Optimizer Building] Using LearningRateForSP as learning rate [default0]:07/06/2024 09:34:37 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: [ZeRO sharding] Size of optimizer params per rank: [default0]:07/06/2024 09:34:37 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: [ZeRO sharding] DP Rank 0 has 21.6M out of 345M (6.25%) params' optimizer states [default0]:07/06/2024 09:34:37 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: [ZeRO sharding] DP Rank 1 has 21.6M out of 345M (6.25%) params' optimizer states [default0]:07/06/2024 09:34:37 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: [ZeRO sharding] DP Rank 2 has 21.6M out of 345M (6.25%) params' optimizer states [default0]:07/06/2024 09:34:37 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: [ZeRO sharding] DP Rank 3 has 21.6M out of 345M (6.25%) params' optimizer states [default0]:07/06/2024 09:34:37 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: [ZeRO sharding] DP Rank 4 has 21.6M out of 345M (6.25%) params' optimizer states [default0]:07/06/2024 09:34:37 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: [ZeRO sharding] DP Rank 5 has 21.6M out of 345M (6.25%) params' optimizer states [default0]:07/06/2024 09:34:37 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: [ZeRO sharding] DP Rank 6 has 21.6M out of 345M (6.25%) params' optimizer states [default0]:07/06/2024 09:34:37 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: [ZeRO sharding] DP Rank 7 has 21.6M out of 345M (6.25%) params' optimizer states [default0]:07/06/2024 09:34:37 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: [ZeRO sharding] DP Rank 8 has 21.6M out of 345M (6.25%) params' optimizer states [default0]:07/06/2024 09:34:37 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: [ZeRO sharding] DP Rank 9 has 21.6M out of 345M (6.25%) params' optimizer states [default0]:07/06/2024 09:34:37 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: [ZeRO sharding] DP Rank 10 has 21.6M out of 345M (6.25%) params' optimizer states [default0]:07/06/2024 09:34:37 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: [ZeRO sharding] DP Rank 11 has 21.6M out of 345M (6.25%) params' optimizer states [default0]:07/06/2024 09:34:37 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: [ZeRO sharding] DP Rank 12 has 21.6M out of 345M (6.25%) params' optimizer states [default0]:07/06/2024 09:34:37 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: [ZeRO sharding] DP Rank 13 has 21.6M out of 345M (6.25%) params' optimizer states [default0]:07/06/2024 09:34:37 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: [ZeRO sharding] DP Rank 14 has 21.6M out of 345M (6.25%) params' optimizer states [default0]:07/06/2024 09:34:37 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: [ZeRO sharding] DP Rank 15 has 21.6M out of 345M (6.25%) params' optimizer states [default0]:07/06/2024 09:34:40 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: [Training Plan] Stage Training Stage has 19 remaining training steps and has consumed 0 samples [default0]:07/06/2024 09:34:40 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: Using `datasets` library [default0]:07/06/2024 09:34:40 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: Loading tokenizer from openai-community/gpt2 and transformers/hf_hub versions ('4.41.2', '0.23.4') [default0]:Repo card metadata block was not found. Setting CardData to empty. [default0]:07/06/2024 09:34:40 [WARNING|DP=0|PP=0|TP=0|ip-26-0-160-192]: Repo card metadata block was not found. Setting CardData to empty. [default0]:07/06/2024 09:34:41 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: [Training Plan] There are 1 training stages [default0]:07/06/2024 09:34:41 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: [Stage Training Stage] start from step 1 [default0]:07/06/2024 09:34:41 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: [default0]:07/06/2024 09:34:41 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: [Start training] datetime: 2024-07-06 09:34:41.537221 | mbs: 2 | grad_accum: 32 | global_batch_size: 1024 | sequence_length: 4096 | train_steps: 20 | start_iteration_step: 0 | consumed_train_samples: 0 [default1]:07/06/2024 09:34:41 [WARNING|DP=4|PP=1|TP=1|ip-26-0-165-59]: Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default7]:07/06/2024 09:34:41 [WARNING|DP=11|PP=1|TP=1|ip-26-0-173-246]: Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default5]:07/06/2024 09:34:41 [WARNING|DP=14|PP=0|TP=1|ip-26-0-163-134]: Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default0]:07/06/2024 09:34:41 [WARNING|DP=0|PP=1|TP=0|ip-26-0-165-131]: Repo card metadata block was not found. Setting CardData to empty. [default7]:07/06/2024 09:34:41 [WARNING|DP=3|PP=1|TP=1|ip-26-0-165-131]: Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default6]:07/06/2024 09:34:41 [WARNING|DP=3|PP=1|TP=0|ip-26-0-165-131]: Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default3]:07/06/2024 09:34:41 [WARNING|DP=1|PP=1|TP=1|ip-26-0-165-131]: Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default4]:07/06/2024 09:34:41 [WARNING|DP=2|PP=1|TP=0|ip-26-0-165-131]: Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default1]:07/06/2024 09:34:41 [WARNING|DP=0|PP=1|TP=1|ip-26-0-165-131]: Repo card metadata block was not found. Setting CardData to empty. [default5]:07/06/2024 09:34:41 [WARNING|DP=2|PP=1|TP=1|ip-26-0-165-131]: Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default3]:07/06/2024 09:34:41 [WARNING|DP=9|PP=0|TP=1|ip-26-0-161-78]: Repo card metadata block was not found. Setting CardData to empty. [default2]:07/06/2024 09:34:41 [WARNING|DP=9|PP=0|TP=0|ip-26-0-161-78]: Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default2]:07/06/2024 09:34:41 [WARNING|DP=1|PP=1|TP=0|ip-26-0-165-131]: Repo card metadata block was not found. Setting CardData to empty. [default4]:07/06/2024 09:34:41 [WARNING|DP=10|PP=0|TP=0|ip-26-0-161-78]: Repo card metadata block was not found. Setting CardData to empty. [default1]:07/06/2024 09:34:41 [WARNING|DP=8|PP=0|TP=1|ip-26-0-161-78]: Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default5]:07/06/2024 09:34:41 [WARNING|DP=10|PP=0|TP=1|ip-26-0-161-78]: Repo card metadata block was not found. Setting CardData to empty. [default0]:07/06/2024 09:34:41 [WARNING|DP=12|PP=0|TP=0|ip-26-0-163-134]: Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default7]:07/06/2024 09:34:41 [WARNING|DP=11|PP=0|TP=1|ip-26-0-161-78]: Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default2]:07/06/2024 09:34:41 [WARNING|DP=13|PP=0|TP=0|ip-26-0-163-134]: Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default0]:07/06/2024 09:34:41 [WARNING|DP=8|PP=1|TP=0|ip-26-0-173-246]: Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default0]:07/06/2024 09:34:41 [WARNING|DP=12|PP=1|TP=0|ip-26-0-174-36]: Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default1]:07/06/2024 09:34:41 [WARNING|DP=8|PP=1|TP=1|ip-26-0-173-246]: Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default7]:07/06/2024 09:34:41 [WARNING|DP=7|PP=0|TP=1|ip-26-0-161-142]: Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default7]:07/06/2024 09:34:41 [WARNING|DP=3|PP=0|TP=1|ip-26-0-160-192]: Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default3]:07/06/2024 09:34:41 [WARNING|DP=13|PP=0|TP=1|ip-26-0-163-134]: Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default6]:07/06/2024 09:34:41 [WARNING|DP=11|PP=1|TP=0|ip-26-0-173-246]: Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default4]:07/06/2024 09:34:41 [WARNING|DP=14|PP=0|TP=0|ip-26-0-163-134]: Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default0]:07/06/2024 09:34:41 [WARNING|DP=4|PP=1|TP=0|ip-26-0-165-59]: Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default6]:07/06/2024 09:34:41 [WARNING|DP=7|PP=0|TP=0|ip-26-0-161-142]: Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default0]:07/06/2024 09:34:41 [WARNING|DP=8|PP=0|TP=0|ip-26-0-161-78]: Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default4]:07/06/2024 09:34:41 [WARNING|DP=10|PP=1|TP=0|ip-26-0-173-246]: Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default0]:07/06/2024 09:34:41 [WARNING|DP=4|PP=0|TP=0|ip-26-0-161-142]: Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default3]:07/06/2024 09:34:41 [WARNING|DP=9|PP=1|TP=1|ip-26-0-173-246]: Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default1]:07/06/2024 09:34:41 [WARNING|DP=4|PP=0|TP=1|ip-26-0-161-142]: Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default3]:07/06/2024 09:34:41 [WARNING|DP=1|PP=0|TP=1|ip-26-0-160-192]: Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default3]:07/06/2024 09:34:41 [WARNING|DP=5|PP=0|TP=1|ip-26-0-161-142]: Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default5]:07/06/2024 09:34:41 [WARNING|DP=10|PP=1|TP=1|ip-26-0-173-246]: Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default4]:07/06/2024 09:34:41 [WARNING|DP=6|PP=0|TP=0|ip-26-0-161-142]: Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default7]:07/06/2024 09:34:41 [WARNING|DP=15|PP=1|TP=1|ip-26-0-174-36]: Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default6]:07/06/2024 09:34:41 [WARNING|DP=11|PP=0|TP=0|ip-26-0-161-78]: Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default1]:07/06/2024 09:34:41 [WARNING|DP=12|PP=1|TP=1|ip-26-0-174-36]: Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default7]:07/06/2024 09:34:41 [WARNING|DP=7|PP=1|TP=1|ip-26-0-165-59]: Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default2]:07/06/2024 09:34:41 [WARNING|DP=5|PP=0|TP=0|ip-26-0-161-142]: Repo card metadata block was not found. Setting CardData to empty. [default5]:07/06/2024 09:34:41 [WARNING|DP=6|PP=0|TP=1|ip-26-0-161-142]: Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default2]:07/06/2024 09:34:41 [WARNING|DP=13|PP=1|TP=0|ip-26-0-174-36]: Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default3]:07/06/2024 09:34:41 [WARNING|DP=5|PP=1|TP=1|ip-26-0-165-59]: Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default5]:07/06/2024 09:34:41 [WARNING|DP=6|PP=1|TP=1|ip-26-0-165-59]: Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default6]:07/06/2024 09:34:41 [WARNING|DP=15|PP=0|TP=0|ip-26-0-163-134]: Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default2]:07/06/2024 09:34:41 [WARNING|DP=1|PP=0|TP=0|ip-26-0-160-192]: Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default5]:07/06/2024 09:34:41 [WARNING|DP=14|PP=1|TP=1|ip-26-0-174-36]: Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default2]:07/06/2024 09:34:41 [WARNING|DP=5|PP=1|TP=0|ip-26-0-165-59]: Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default6]:07/06/2024 09:34:41 [WARNING|DP=7|PP=1|TP=0|ip-26-0-165-59]: Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default1]:07/06/2024 09:34:41 [WARNING|DP=0|PP=0|TP=1|ip-26-0-160-192]: Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default6]:07/06/2024 09:34:41 [WARNING|DP=15|PP=1|TP=0|ip-26-0-174-36]: Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default5]:07/06/2024 09:34:41 [WARNING|DP=2|PP=0|TP=1|ip-26-0-160-192]: Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default4]:07/06/2024 09:34:41 [WARNING|DP=6|PP=1|TP=0|ip-26-0-165-59]: Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default4]:07/06/2024 09:34:41 [WARNING|DP=14|PP=1|TP=0|ip-26-0-174-36]: Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default7]:07/06/2024 09:34:41 [WARNING|DP=15|PP=0|TP=1|ip-26-0-163-134]: Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default6]:07/06/2024 09:34:41 [WARNING|DP=3|PP=0|TP=0|ip-26-0-160-192]: Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default4]:07/06/2024 09:34:41 [WARNING|DP=2|PP=0|TP=0|ip-26-0-160-192]: Repo card metadata block was not found. Setting CardData to empty. [default1]:07/06/2024 09:34:41 [WARNING|DP=12|PP=0|TP=1|ip-26-0-163-134]: Repo card metadata block was not found. Setting CardData to empty. [default3]:07/06/2024 09:34:42 [WARNING|DP=13|PP=1|TP=1|ip-26-0-174-36]: Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default2]:07/06/2024 09:34:42 [WARNING|DP=9|PP=1|TP=0|ip-26-0-173-246]: Repo card metadata block was not found. Setting CardData to empty. [default0]:07/06/2024 09:34:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: Resuming training from stage Training Stage, it has trained for 0 samples and has 19 remaining train steps [default0]:07/06/2024 09:34:48 [INFO|DP=0|PP=0|TP=0|ip-26-0-160-192]: Memory usage: 2071.11MiB. Peak allocated 2071.11MiB. Peak reserved: 2092.00MiB [default2]:Traceback (most recent call last): [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]: trainer.train(dataloader) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default2]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default2]: outputs = self.pipeline_engine.train_batch_iter( [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default2]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default2]: output = model(**micro_batch) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default2]: return self._call_impl(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default2]: return forward_call(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default2]: sharded_logits = self.model( [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default2]: return self._call_impl(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default2]: return forward_call(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default2]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default2]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default2]: return self._call_impl(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default2]: return forward_call(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default2]: output = self.pp_block(**new_kwargs) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default2]: return self._call_impl(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default2]: return forward_call(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 630, in forward [default2]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default2]: return self._call_impl(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default2]: return forward_call(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 565, in forward [default2]: query_states, key_value_states = self.flash_rotary_embedding(query_states, kv=key_value_states) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default2]: return self._call_impl(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default2]: return forward_call(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/flash_attn/layers/rotary.py", line 457, in forward [default2]: q = apply_rotary_emb_func( [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/flash_attn/layers/rotary.py", line 122, in apply_rotary_emb [default2]: return ApplyRotaryEmb.apply( [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 553, in apply [default2]: return super().apply(*args, **kwargs) # type: ignore[misc] [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/flash_attn/layers/rotary.py", line 48, in forward [default2]: out = apply_rotary( [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/flash_attn/ops/triton/rotary.py", line 202, in apply_rotary [default2]: rotary_kernel[grid]( [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/triton/runtime/jit.py", line 532, in run [default2]: self.cache[device][key] = compile( [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/triton/compiler/compiler.py", line 503, in compile [default2]: metadata_group = fn_cache_manager.get_group(metadata_filename) or {} [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/triton/runtime/cache.py", line 90, in get_group [default2]: grp_data = json.load(f) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/json/__init__.py", line 293, in load [default2]: return loads(fp.read(), [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/json/__init__.py", line 346, in loads [default2]: return _default_decoder.decode(s) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/json/decoder.py", line 337, in decode [default2]: obj, end = self.raw_decode(s, idx=_w(s, 0).end()) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/json/decoder.py", line 355, in raw_decode [default2]: raise JSONDecodeError("Expecting value", s, err.value) from None [default2]:json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0) [default5]:Traceback (most recent call last): [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]: trainer.train(dataloader) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default5]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default5]: outputs = self.pipeline_engine.train_batch_iter( [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default5]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default5]: output = model(**micro_batch) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default5]: return self._call_impl(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]: return forward_call(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default5]: sharded_logits = self.model( [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default5]: return self._call_impl(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]: return forward_call(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default5]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default5]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default5]: return self._call_impl(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]: return forward_call(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default5]: output = self.pp_block(**new_kwargs) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default5]: return self._call_impl(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]: return forward_call(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 636, in forward [default5]: hidden_states = self.mlp(hidden_states=hidden_states)["hidden_states"] [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default5]: return self._call_impl(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]: return forward_call(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 171, in forward [default5]: hidden_states = self.down_proj(self.split_silu_mul(merged_states)) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default5]: return self._call_impl(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]: return forward_call(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default5]: return row_linear( [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 474, in row_linear [default5]: out = F.linear(input, weight, bias) [default5]:torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB. GPU 5 has a total capacity of 79.33 GiB of which 7.94 MiB is free. Including non-PyTorch memory, this process has 79.31 GiB memory in use. Of the allocated memory 70.12 GiB is allocated by PyTorch, and 56.49 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [2024-07-06 09:34:53,965] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 627087 closing signal SIGTERM [2024-07-06 09:34:53,965] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 627088 closing signal SIGTERM [2024-07-06 09:34:53,967] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 627090 closing signal SIGTERM [2024-07-06 09:34:53,967] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 627091 closing signal SIGTERM [2024-07-06 09:34:53,968] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 627092 closing signal SIGTERM [2024-07-06 09:34:53,972] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 627093 closing signal SIGTERM [2024-07-06 09:34:53,973] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 627094 closing signal SIGTERM [default5]:Traceback (most recent call last): [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]: trainer.train(dataloader) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default5]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default5]: outputs = self.pipeline_engine.train_batch_iter( [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default5]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default5]: output = model(**micro_batch) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default5]: return self._call_impl(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]: return forward_call(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default5]: sharded_logits = self.model( [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default5]: return self._call_impl(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]: return forward_call(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default5]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default5]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default5]: return self._call_impl(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]:Traceback (most recent call last): [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]: trainer.train(dataloader) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default5]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default5]: outputs = self.pipeline_engine.train_batch_iter( [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default5]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default5]: output = model(**micro_batch) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default5]: return self._call_impl(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]: return forward_call(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default5]: sharded_logits = self.model( [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default5]: return self._call_impl(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]: return forward_call(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default5]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default5]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default5]: return self._call_impl(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]: return forward_call(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default5]: output = self.pp_block(**new_kwargs) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default5]: return self._call_impl(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]: return forward_call(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 636, in forward [default5]: hidden_states = self.mlp(hidden_states=hidden_states)["hidden_states"] [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default5]: return self._call_impl(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]: return forward_call(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 171, in forward [default5]: hidden_states = self.down_proj(self.split_silu_mul(merged_states)) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default5]: return self._call_impl(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]: return forward_call(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default5]: return row_linear( [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 474, in row_linear [default5]: out = F.linear(input, weight, bias) [default5]:torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB. GPU 5 has a total capacity of 79.33 GiB of which 7.94 MiB is free. Including non-PyTorch memory, this process has 79.31 GiB memory in use. Of the allocated memory 70.12 GiB is allocated by PyTorch, and 56.49 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default5]: return forward_call(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default5]: output = self.pp_block(**new_kwargs) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default5]: return self._call_impl(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]: return forward_call(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 636, in forward [default5]: hidden_states = self.mlp(hidden_states=hidden_states)["hidden_states"] [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default5]: return self._call_impl(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]: return forward_call(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 171, in forward [default5]: hidden_states = self.down_proj(self.split_silu_mul(merged_states)) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default5]: return self._call_impl(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]: return forward_call(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default5]: return row_linear( [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 474, in row_linear [default5]: out = F.linear(input, weight, bias) [default5]:torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB. GPU 5 has a total capacity of 79.33 GiB of which 7.94 MiB is free. Including non-PyTorch memory, this process has 79.31 GiB memory in use. Of the allocated memory 70.12 GiB is allocated by PyTorch, and 56.49 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default2]:Traceback (most recent call last): [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]: trainer.train(dataloader) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default2]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default2]: outputs = self.pipeline_engine.train_batch_iter( [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default2]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default2]: output = model(**micro_batch) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default2]: return self._call_impl(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default2]: return forward_call(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default2]: sharded_logits = self.model( [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default2]: return self._call_impl(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default2]: return forward_call(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default2]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default2]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default2]: return self._call_impl(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default2]: return forward_call(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default2]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default2]: pipeline_state.run_communication() [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default2]: recv_activation_tensor = recv_activation() [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default2]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default2]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default2]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default2]: dist.recv( [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default2]: return func(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default2]: pg.recv([tensor], group_src_rank, tag).wait() [default2]:torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default2]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f50d7902d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #1: + 0x589518e (0x7f510f8bc18e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7f510f8b69a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f510f8b6ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f510f8b7b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f510f86cf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f510f86cf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f510f86cf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f510f86cf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f50d8aaac69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7f50d8ab1c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7f50d8ad4b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #12: + 0x5838439 (0x7f510f85f439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #13: + 0x5843330 (0x7f510f86a330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #14: + 0x58433c5 (0x7f510f86a3c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #15: + 0x4e893cc (0x7f510eeb03cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #16: + 0x1a08a88 (0x7f510ba2fa88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #17: + 0x5849a84 (0x7f510f870a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #18: + 0x584ed35 (0x7f510f875d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #19: + 0xc97eee (0x7f5122127eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:Traceback (most recent call last): [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]: trainer.train(dataloader) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default3]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default3]: outputs = self.pipeline_engine.train_batch_iter( [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default2]:frame #20: + 0x413ea4 (0x7f51218a3ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default2]:frame #21: + 0x1445a6 (0x559c0854b5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #22: _PyObject_MakeTpCall + 0x26b (0x559c08544a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #23: + 0x150866 (0x559c08557866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x559c08540142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default3]: output = model(**micro_batch) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]: return self._call_impl(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default3]: return forward_call(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default3]: sharded_logits = self.model( [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 151[default2]:frame #25: _PyFunction_Vectorcall + 0x6c (0x559c0854ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #26: PyObject_Call + 0xbc (0x559c08557f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) 1, in _wrapped_call_impl [default3]: return self._call_impl(*args, **kwargs) [default2]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x559c0853e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #28: _PyFunction_Vectorcall + 0x6c (0x559c0854ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x559c0853c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default3]: return forward_call(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default3]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]: return self._call_impl(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl[default2]:frame #30: + 0x150582 (0x559c08557582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x559c0853c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #32: + 0x150582 (0x559c08557582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x559c0853c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: return forward_call(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default3]: output = self.pp_block(**new_kwargs) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]: return self._call_impl(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default3]: return forward_call(*args, **kwargs) [default2]:frame #34: + 0x150582 (0x559c08557582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x559c0853c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x559c08543f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 636, in forward [default3]: hidden_states = self.mlp(hidden_states=hidden_states)["hidden_states"] [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]: return self._call_impl(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default3]: return forward_call(*args, **kwargs) [default2]:frame #37: _PyObject_Call_Prepend + 0x69 (0x559c08555c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #38: + 0x211239 (0x559c08618239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #39: _PyObject_MakeTpCall + 0x26b (0x559c08544a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 170, in forward [default3]: merged_states = self.gate_up_proj(hidden_states) [default2]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x559c085403e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #41: _PyFunction_Vectorcall + 0x6c (0x559c0854ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]: return self._call_impl(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default2]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x559c0853bc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #43: _PyFunction_Vectorcall + 0x6c (0x559c0854ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x559c0853c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #45: + 0x150582 (0x559c08557582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #46: PyObject_Call + 0xbc (0x559c08557f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x559c0853e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #48: + 0x150582 (0x559c08557582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #49: PyObject_Call + 0xbc (0x559[default3]: return forward_call(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 87, in forward c08557f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x559c0853e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #51: _PyFunction_Vectorcall + 0x6c (0x559c0854ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: return column_linear( [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 359, in column_linear [default3]: return F.linear(input, weight, bias) [default2]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x559c08544007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #53: _PyObject_Call_Prepend + 0x69 (0x559c08555c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #54: + 0x211239 (0x559c08618239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #55: PyObject_Call + 0x207 (0x559c08558067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 64.00 MiB. GPU 3 has a total capacity of 79.33 GiB of which 55.94 MiB is free. Including non-PyTorch memory, this process has 79.26 GiB memory in use. Of the allocated memory 70.33 GiB is allocated by PyTorch, and 55.95 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default2]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x559c0853e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #57: + 0x150582 (0x559c08557582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x559c0853c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #59: + 0x150582 (0x559c08557582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #60: PyObject_Call + 0xbc (0x559c08557f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x559c0853e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #62: + 0x150582 (0x559c08557582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #63: PyObject_Call + 0xbc (0x559c08557f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default3]:Traceback (most recent call last): [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]: trainer.train(dataloader) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default3]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default3]: outputs = self.pipeline_engine.train_batch_iter( [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default3]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default3]: output = model(**micro_batch) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]: return self._call_impl(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default3]: return forward_call(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default3]: sharded_logits = self.model( [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]: return self._call_impl(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default3]: return forward_call(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default3]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]: return self._call_impl(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default3]: return forward_call(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default3]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default3]: pipeline_state.run_communication() [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default3]: recv_activation_tensor = recv_activation() [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default3]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default3]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default3]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default3]: dist.recv( [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default3]: return func(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default3]: pg.recv([tensor], group_src_rank, tag).wait() [default3]:torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default3]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f1fa10e3d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: + 0x589518e (0x7f1fd909d18e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7f1fd90979a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f1fd9097ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f1fd9098b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f1fd904df81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f1fd904df81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f1fd904df81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f1fd904df81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f1fa228bc69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7f1fa2292c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7f1fa22b5b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #12: + 0x5838439 (0x7f1fd9040439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #13: + 0x5843330 (0x7f1fd904b330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #14: + 0x58433c5 (0x7f1fd904b3c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #15: + 0x4e893cc (0x7f1fd86913cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #16: + 0x1a08a88 (0x7f1fd5210a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #17: + 0x5849a84 (0x7f1fd9051a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #18: + 0x584ed35 (0x7f1fd9056d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #19: + 0xc97eee (0x7f1feb908eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:frame #20: + 0x413ea4 (0x7f1feb084ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:frame #21: + 0x1445a6 (0x5594967145a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #22: _PyObject_MakeTpCall + 0x26b (0x55949670da6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #23: + 0x150866 (0x559496720866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x559496709142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #25: _PyFunction_Vectorcall + 0x6c (0x559496714a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #26: PyObject_Call + 0xbc (0x559496720f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5594967072b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #28: _PyFunction_Vectorcall + 0x6c (0x559496714a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5594967058fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #30: + 0x150582 (0x559496720582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5594967058fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #32: + 0x150582 (0x559496720582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5594967058fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #34: + 0x150582 (0x559496720582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5594967058fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55949670cf50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #37: _PyObject_Call_Prepend + 0x69 (0x55949671ec39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #38: + 0x211239 (0x5594967e1239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #39: _PyObject_MakeTpCall + 0x26b (0x55949670da6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5594967093e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #41: _PyFunction_Vectorcall + 0x6c (0x559496714a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x559496704c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #43: _PyFunction_Vectorcall + 0x6c (0x559496714a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5594967058fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #45: + 0x150582 (0x559496720582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #46: PyObject_Call + 0xbc (0x559496720f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5594967072b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #48: + 0x150582 (0x559496720582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #49: PyObject_Call + 0xbc (0x559496720f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5594967072b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #51: _PyFunction_Vectorcall + 0x6c (0x559496714a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55949670d007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #53: _PyObject_Call_Prepend + 0x69 (0x55949671ec39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #54: + 0x211239 (0x5594967e1239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #55: PyObject_Call + 0x207 (0x559496721067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5594967072b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #57: + 0x150582 (0x559496720582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5594967058fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #59: + 0x150582 (0x559496720582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #60: PyObject_Call + 0xbc (0x559496720f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5594967072b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #62: + 0x150582 (0x559496720582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #63: PyObject_Call + 0xbc (0x559496720f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default6]:Traceback (most recent call last): [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]: trainer.train(dataloader) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default6]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default6]: outputs = self.pipeline_engine.train_batch_iter( [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default6]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default6]: output = model(**micro_batch) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default6]: return self._call_impl(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default6]: return forward_call(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default6]: sharded_logits = self.model( [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default6]: return self._call_impl(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default6]: return forward_call(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default6]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default6]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default6]: return self._call_impl(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default6]: return forward_call(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default6]: output = self.pp_block(**new_kwargs) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default6]: return self._call_impl(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default6]: return forward_call(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 630, in forward [default6]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default6]: return self._call_impl(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default6]: return forward_call(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 562, in forward [default6]: key_value_states = torch.cat([key_states.unsqueeze(0), value_states.unsqueeze(0)], dim=0) [default6]:torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB. GPU 6 has a total capacity of 79.33 GiB of which 15.94 MiB is free. Including non-PyTorch memory, this process has 79.30 GiB memory in use. Of the allocated memory 70.25 GiB is allocated by PyTorch, and 40.47 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default7]:Traceback (most recent call last): [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]: trainer.train(dataloader) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default7]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default7]: outputs = self.pipeline_engine.train_batch_iter( [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default7]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default7]: output = model(**micro_batch) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default7]: return self._call_impl(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default7]: return forward_call(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default7]: sharded_logits = self.model( [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default7]: return self._call_impl(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default7]: return forward_call(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default7]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default7]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default7]: return self._call_impl(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default7]: return forward_call(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default7]: output = self.pp_block(**new_kwargs) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default7]: return self._call_impl(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default7]: return forward_call(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 636, in forward [default7]: hidden_states = self.mlp(hidden_states=hidden_states)["hidden_states"] [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default7]: return self._call_impl(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default7]: return forward_call(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 170, in forward [default7]: merged_states = self.gate_up_proj(hidden_states) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default7]: return self._call_impl(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default7]: return forward_call(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 87, in forward [default7]: return column_linear( [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 359, in column_linear [default7]: return F.linear(input, weight, bias) [default7]:torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 64.00 MiB. GPU 7 has a total capacity of 79.33 GiB of which 15.94 MiB is free. Including non-PyTorch memory, this process has 79.30 GiB memory in use. Of the allocated memory 70.33 GiB is allocated by PyTorch, and 55.95 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default5]:Traceback (most recent call last): [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]: trainer.train(dataloader) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default5]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default5]: outputs = self.pipeline_engine.train_batch_iter( [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default5]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default5]: output = [default1]:Traceback (most recent call last): [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]: trainer.train(dataloader) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default1]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default1]: outputs = self.pipeline_engine.train_batch_iter( [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default1]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]: output = model(**micro_batch) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default5]: return self._call_impl(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl model(**micro_batch) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]: return self._call_impl(*args, **kwargs) [default5]: return forward_call(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default5]: sharded_logits = self.model( [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default5]: return self._call_impl(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]: return forward_call(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default5]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default5][default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]: return forward_call(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default1]: sharded_logits = self.model( [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]: return self._call_impl(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]: return forward_call(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default1]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] : hidden_encoder_states = encoder_block(**hidden_encoder_states) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default5]: return self._call_impl(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]: return forward_call(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default5]: output = self.pp_block(**new_kwargs) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default5]: return self._call_impl(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/mo[default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default1]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]: return self._call_impl(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]: return forward_call(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default1]: output = self.pp_block(**new_kwargs) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]: returdule.py", line 1520, in _call_impl [default5]: return forward_call(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 636, in forward [default5]: hidden_states = self.mlp(hidden_states=hidden_states)["hidden_states"] [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default5]: return self._call_impl(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]: return forward_call(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 171, in forward [default5]: hidden_states = self.down_proj(self.split_silu_mul(merged_states)) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packan self._call_impl(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]: return forward_call(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 630, in forward [default1]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]: return self._call_impl(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl ges/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]: return forward_call(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 388, in forward [default1]: .contiguous() [default1]:torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 48.00 MiB. GPU 1 has a total capacity of 79.33 GiB of which 39.94 MiB is free. Including non-PyTorch memory, this process has 79.27 GiB memory in use. Of the allocated memory 70.53 GiB is allocated by PyTorch, and 39.94 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default5]: return self._call_impl(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]: return forward_call(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 159, in forward [default5]: return row_linear( [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 474, in row_linear [default5]: out = F.linear(input, weight, bias) [default5]:torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB. GPU 5 has a total capacity of 79.33 GiB of which 7.94 MiB is free. Including non-PyTorch memory, this process has 79.31 GiB memory in use. Of the allocated memory 70.12 GiB is allocated by PyTorch, and 56.49 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default1]:Traceback (most recent call last): [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]: trainer.train(dataloader) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default1]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default1]: outputs = self.pipeline_engine.train_batch_iter( [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default1]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]: output = model(**micro_batch) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]: return self._call_impl(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]: return forward_call(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default1]: sharded_logits = self.model( [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]: return self._call_impl(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]: return forward_call(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default1]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default1]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]: return self._call_impl(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]: return forward_call(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default1]: output = self.pp_block(**new_kwargs) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]: return self._call_impl(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]: return forward_call(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 630, in forward [default1]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]: return self._call_impl(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]: return forward_call(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 388, in forward [default1]: .contiguous() [default1]:torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 48.00 MiB. GPU 1 has a total capacity of 79.33 GiB of which 39.94 MiB is free. Including non-PyTorch memory, this process has 79.27 GiB memory in use. Of the allocated memory 70.53 GiB is allocated by PyTorch, and 39.94 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default3]:Traceback (most recent call last): [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]: trainer.train(dataloader) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default3]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default3]: outputs = self.pipeline_engine.train_batch_iter( [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default3]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default6]:Traceback (most recent call last): [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]: trainer.train(dataloader) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default6]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default3]: output = model(**micro_batch) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default6]: outputs = self.pipeline_engine.train_batch_iter( [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default6]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default6]: output = model(**micro_batch) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]: return self._call_impl(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default3]: return forward_call(*args, **kwargs) [default6]: return self._call_impl(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default6]: return forward_call(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default6]: sharded_logits = self.model( [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default6]: return self._call_impl(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default6]: return forward_call(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default6]: return self.forward_with_hidden[default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default3]: sharded_logits = self.model( [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl _states(input_ids=input_ids, input_mask=input_mask)[0] [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default6]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default6]: return self._call_impl(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default6]: return forward_call(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default6]: output = self.pp_block(**new_kwargs) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py"[default3]: return self._call_impl(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default3]: return forward_call(*args, **kwargs) , line 1511, in _wrapped_call_impl [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default3]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default6]: return self._call_impl(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default6]: return forward_call(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 630, in forward [default6]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default6]: return self._call_impl(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default6]: return forward_call(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 562, in forwar[default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl d [default3]: return self._call_impl(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default3]: return forward_call(*args, **kwargs) [default6]: key_value_states = torch.cat([key_states.unsqueeze(0), value_states.unsqueeze(0)], dim=0) [default6]:torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB. GPU 6 has a total capacity of 79.33 GiB of which 15.94 MiB is free. Including non-PyTorch memory, this process has 79.30 GiB memory in use. Of the allocated memory 70.25 GiB is allocated by PyTorch, and 40.47 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default3]: output = self.pp_block(**new_kwargs) [default7]:Traceback (most recent call last): [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]: trainer.train(dataloader) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default7]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default7]: outputs = self.pipeline_engine.train_batch_iter( [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default7]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default7]: output = [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]: return self._call_impl(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl model(**micro_batch) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default7]: return self._call_impl(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default7]: return forward_call(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default7]: sharded_logits = self.model( [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default7]: return self._call_impl(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default7]: return forward_call(*args[default3]: return forward_call(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 636, in forward [default3]: hidden_states = self.mlp(hidden_states=hidden_states)["hidden_states"] , **kwargs) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default7]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default7]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default7]: return self._call_impl(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default7]: return forward_call(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]: return self._call_impl(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default7]: output = self.pp_block(**new_kwargs) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default7]: return self._call_impl(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default7]: return forward_call(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 636, in forward [default7]: hidden_states = self.mlp(hidden_states=hidden_states)["hidden_states"] [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default7]: return self._cal[default3]: return forward_call(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 170, in forward [default3]: merged_states = self.gate_up_proj(hidden_states) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]: return self._call_impl(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default3]: return forward_call(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 87, in forward [default3]: return column_linear( [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 359, in column_linear l_impl(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default7]: return forward_call(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 170, in forward [default7]: merged_states = self.gate_up_proj(hidden_states) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default7]: return self._call_impl(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default7]: return forward_call(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 87, in forward [default7]: return column_linear( [defau[default3]: return F.linear(input, weight, bias) [default3]:torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 64.00 MiB. GPU 3 has a total capacity of 79.33 GiB of which 53.94 MiB is free. Including non-PyTorch memory, this process has 79.26 GiB memory in use. Of the allocated memory 70.33 GiB is allocated by PyTorch, and 55.95 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) lt7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 359, in column_linear [default7]: return F.linear(input, weight, bias) [default7]:torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 64.00 MiB. GPU 7 has a total capacity of 79.33 GiB of which 15.94 MiB is free. Including non-PyTorch memory, this process has 79.30 GiB memory in use. Of the allocated memory 70.33 GiB is allocated by PyTorch, and 55.95 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default1]:Traceback (most recent call last): [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]: trainer.train(dataloader) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default1]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default1]: outputs = self.pipeline_engine.train_batch_iter( [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default1]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]: output = model(**micro_batch) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]: return self._call_impl(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]: return forward_call(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default1]: sharded_logits = self.model( [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]: return self._call_impl(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]: return forward_call(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default1]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default1]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]: return self._call_impl(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]: return forward_call(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default1]: output = self.pp_block(**new_kwargs) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]: return self._call_impl(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]: return forward_call(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 630, in forward [default1]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]: return self._call_impl(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]: return forward_call(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 388, in forward [default1]: .contiguous() [default1]:torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 48.00 MiB. GPU 1 has a total capacity of 79.33 GiB of which 39.94 MiB is free. Including non-PyTorch memory, this process has 79.27 GiB memory in use. Of the allocated memory 70.53 GiB is allocated by PyTorch, and 39.94 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default7]:Traceback (most recent call last): [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]: trainer.train(dataloader) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default7]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default7]: outputs = self.pipeline_engine.train_batch_iter( [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default7]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default7]: output = model(**micro_batch) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default7]: return self._call_impl(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default7]: return forward_call(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default7]: sharded_logits = self.model( [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default7]: return self._call_impl(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default7]: return forward_call(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default7]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default7]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default7]: return self._call_impl(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default7]: return forward_call(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default7]: output = self.pp_block(**new_kwargs) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default7]: return self._call_impl(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default7]: return forward_call(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 636, in forward [default7]: hidden_states = self.mlp(hidden_states=hidden_states)["hidden_states"] [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default7]: return self._call_impl(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default7]: return forward_call(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 170, in forward [default7]: merged_states = self.gate_up_proj(hidden_states) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default7]: return self._call_impl(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default7]: return forward_call(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 87, in forward [default7]: return column_linear( [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 359, in column_linear [defa[default3]:Traceback (most recent call last): [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]: trainer.train(dataloader) ult7]: return F.linear(input, weight, bias) [default7]:torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 64.00 MiB. GPU 7 has a total capacity of 79.33 GiB of which 15.94 MiB is free. Including non-PyTorch memory, this process has 79.30 GiB memory in use. Of the allocated memory 70.33 GiB is allocated by PyTorch, and 55.95 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default3]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default3]: outputs = self.pipeline_engine.train_batch_iter( [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default3]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default3]: output = model(**micro_batch) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]: return self._call_impl(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default3]: return forward_call(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default3]: sharded_logits = self.model( [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]: return self._call_impl(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default3]: return forward_call(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default3]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]: return self._call_impl(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default3]: return forward_call(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default3]: output = self.pp_block(**new_kwargs) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]: return self._call_impl(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default3]: return forward_call(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 636, in forward [default3]: hidden_states = self.mlp(hidden_states=hidden_states)["hidden_states"] [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]: return self._call_impl(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default3]: return forward_call(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 170, in forward [default3]: merged_states = self.gate_up_proj(hidden_states) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]: return self._call_impl(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default3]: return forward_call(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 87, in forward [default3]: return column_linear( [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/tensor_parallel/functional.py", line 359, in column_linear [default3]: return F.linear(input, weight, bias) [default3]:torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 64.00 MiB. GPU 3 has a total capacity of 79.33 GiB of which 55.94 MiB is free. Including non-PyTorch memory, this process has 79.26 GiB memory in use. Of the allocated memory 70.33 GiB is allocated by PyTorch, and 55.95 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default6]:Traceback (most recent call last): [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]: trainer.train(dataloader) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default6]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default6]: outputs = self.pipeline_engine.train_batch_iter( [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default6]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default6]: output = model(**micro_batch) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default6]: return self._call_impl(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default6]: return forward_call(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default6]: sharded_logits = self.model( [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default6]: return self._call_impl(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default6]: return forward_call(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default6]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default6]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default6]: return self._call_impl(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default6]: return forward_call(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default6]: output = self.pp_block(**new_kwargs) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default6]: return self._call_impl(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default6]: return forward_call(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 630, in forward [default6]: output = self.attn(hidden_states=hidden_states, sequence_mask=sequence_mask) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default6]: return self._call_impl(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default6]: return forward_call(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 562, in forward [default6]: key_value_states = torch.cat([key_states.unsqueeze(0), value_states.unsqueeze(0)], dim=0) [default6]:torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB. GPU 6 has a total capacity of 79.33 GiB of which 15.94 MiB is free. Including non-PyTorch memory, this process has 79.30 GiB memory in use. Of the allocated memory 70.25 GiB is allocated by PyTorch, and 40.47 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [default5]:Traceback (most recent call last): [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]: trainer.train(dataloader) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default5]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default5]: outputs = self.pipeline_engine.train_batch_iter( [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default5]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default5]: output = model(**micro_batch) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default5]: return self._call_impl(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]: return forward_call(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default5]: sharded_logits = self.model( [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default5]: return self._call_impl(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]: return forward_call(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default5]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default5]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default5]: return self._call_impl(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]: return forward_call(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default5]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default5]: pipeline_state.run_communication() [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default5]: recv_activation_tensor = recv_activation() [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default5]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default5]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default5]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default5]: dist.recv( [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default5]: return func(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default5]: pg.recv([tensor], group_src_rank, tag).wait() [default5]:torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default4]:Traceback (most recent call last): [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]: trainer.train(dataloader) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default4]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default4]: outputs = self.pipeline_engine.train_batch_iter( [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default4]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default4]: output = model(**micro_batch) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default4]: return self._call_impl(*args, **kwargs) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default4]: return forward_call(*args, **kwargs) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default4]: sharded_logits = self.model( [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default4]: return self._call_impl(*args, **kwargs) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default4]: return forward_call(*args, **kwargs) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default4]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default4]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default4]: return self._call_impl(*args, **kwargs) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default4]: return forward_call(*args, **kwargs) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default4]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default4]: pipeline_state.run_communication() [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default4]: recv_activation_tensor = recv_activation() [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default4]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default4]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default4]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default4]: dist.recv( [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default4]: return func(*args, **kwargs) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default4]: pg.recv([tensor], group_src_rank, tag).wait() [default4]:torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default1]:Traceback (most recent call last): [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]: trainer.train(dataloader) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default1]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default1]: outputs = self.pipeline_engine.train_batch_iter( [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default1]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]: output = model(**micro_batch) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]: return self._call_impl(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]: return forward_call(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default1]: sharded_logits = self.model( [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]: return self._call_impl(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]: return forward_call(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default1]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default1]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]: return self._call_impl(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]: return forward_call(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default1]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default1]: pipeline_state.run_communication() [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default1]: recv_activation_tensor = recv_activation() [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default1]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default1]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default1]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default1]: dist.recv( [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default1]: return func(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default1]: pg.recv([tensor], group_src_rank, tag).wait() [default1]:torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default0]:Traceback (most recent call last): [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]: trainer.train(dataloader) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default0]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default0]: outputs = self.pipeline_engine.train_batch_iter( [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default0]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default0]: output = model(**micro_batch) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default0]: return self._call_impl(*args, **kwargs) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default0]: return forward_call(*args, **kwargs) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default0]: sharded_logits = self.model( [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default0]: return self._call_impl(*args, **kwargs) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default0]: return forward_call(*args, **kwargs) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default0]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default0]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default0]: return self._call_impl(*args, **kwargs) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default0]: return forward_call(*args, **kwargs) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default0]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default0]: pipeline_state.run_communication() [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default0]: recv_activation_tensor = recv_activation() [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default0]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default0]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default0]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default0]: dist.recv( [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default0]: return func(*args, **kwargs) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default0]: pg.recv([tensor], group_src_rank, tag).wait() [default0]:torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default6]:Traceback (most recent call last): [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]: trainer.train(dataloader) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default6]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default6]: outputs = self.pipeline_engine.train_batch_iter( [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default6]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default6]: output = model(**micro_batch) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default6]: return self._call_impl(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default6]: return forward_call(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default6]: sharded_logits = self.model( [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default6]: return self._call_impl(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default6]: return forward_call(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default6]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default6]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default6]: return self._call_impl(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default6]: return forward_call(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default6]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default6]: pipeline_state.run_communication() [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default6]: recv_activation_tensor = recv_activation() [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default6]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default6]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default6]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default6]: dist.recv( [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default6]: return func(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default6]: pg.recv([tensor], group_src_rank, tag).wait() [default6]:torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default7]:Traceback (most recent call last): [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]: trainer.train(dataloader) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default7]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default7]: outputs = self.pipeline_engine.train_batch_iter( [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default7]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default7]: output = model(**micro_batch) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default7]: return self._call_impl(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default7]: return forward_call(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default7]: sharded_logits = self.model( [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default7]: return self._call_impl(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default7]: return forward_call(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default7]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default7]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default7]: return self._call_impl(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default7]: return forward_call(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default7]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default7]: pipeline_state.run_communication() [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default7]: recv_activation_tensor = recv_activation() [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default7]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default7]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default7]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default7]: dist.recv( [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default7]: return func(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default7]: pg.recv([tensor], group_src_rank, tag).wait() [default7]:torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default1]:Traceback (most recent call last): [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]: trainer.train(dataloader) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default1]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default1]: outputs = self.pipeline_engine.train_batch_iter( [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default1]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]: output = model(**micro_batch) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]: return self._call_impl(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]: return forward_call(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default1]: sharded_logits = self.model( [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]: return self._call_impl(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]: return forward_call(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default1]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default1]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]: return self._call_impl(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]: return forward_call(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default1]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default1]: pipeline_state.run_communication() [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default1]: recv_activation_tensor = recv_activation() [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default1]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default1]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default1]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default1]: dist.recv( [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default1]: return func(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default1]: pg.recv([tensor], group_src_rank, tag).wait() [default1]:torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default6]:Traceback (most recent call last): [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]: trainer.train(dataloader) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default6]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default6]: outputs = self.pipeline_engine.train_batch_iter( [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default6]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default6]: output = model(**micro_batch) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default6]: return self._call_impl(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default6]: return forward_call(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default6]: sharded_logits = self.model( [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default6]: return self._call_impl(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default6]: return forward_call(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default6]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default6]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default6]: return self._call_impl(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default6]: return forward_call(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default6]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default6]: pipeline_state.run_communication() [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default6]: recv_activation_tensor = recv_activation() [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default6]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default6]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default6]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default6]: dist.recv( [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default6]: return func(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default6]: pg.recv([tensor], group_src_rank, tag).wait() [default6]:torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default7]:Traceback (most recent call last): [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]: trainer.train(dataloader) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default7]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default7]: outputs = self.pipeline_engine.train_batch_iter( [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default7]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default7]: output = model(**micro_batch) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default7]: return self._call_impl(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default7]: return forward_call(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default7]: sharded_logits = self.model( [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default7]: return self._call_impl(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default7]: return forward_call(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default7]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default7]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default7]: return self._call_impl(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default7]: return forward_call(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default7]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default7]: pipeline_state.run_communication() [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default7]: recv_activation_tensor = recv_activation() [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default7]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default7]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default7]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default7]: dist.recv( [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default7]: return func(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default7]: pg.recv([tensor], group_src_rank, tag).wait() [default7]:torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default1]:Traceback (most recent call last): [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]: trainer.train(dataloader) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default1]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default1]: outputs = self.pipeline_engine.train_batch_iter( [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default1]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]: output = model(**micro_batch) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]: return self._call_impl(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]: return forward_call(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default1]: sharded_logits = self.model( [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]: return self._call_impl(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]: return forward_call(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default1]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default1]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]: return self._call_impl(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]: return forward_call(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default1]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default1]: pipeline_state.run_communication() [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default1]: recv_activation_tensor = recv_activation() [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default1]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default1]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default1]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default1]: dist.recv( [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default1]: return func(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default1]: pg.recv([tensor], group_src_rank, tag).wait() [default1]:torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default3]:Traceback (most recent call last): [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]: trainer.train(dataloader) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default3]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default3]: outputs = self.pipeline_engine.train_batch_iter( [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default3]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default3]: output = [default3]:Traceback (most recent call last): [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]: trainer.train(dataloader) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train model(**micro_batch) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]: return self._call_impl(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default3]: return forward_call(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default3]: sharded_logits = self.model( [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]: return self._call_impl(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default3]: return forward_call(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default3]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]: return self._call_impl(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default3]: return forward_call(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default3]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default3]: pipeline_state.run_communication() [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default3]: recv_activation_tensor = recv_activation() [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default3]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default3]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default3]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default3]: dist.recv( [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default3]: return func(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default3]: pg.recv([tensor], group_src_rank, tag).wait() [default3]:torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default3]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default3]: outputs = self.pipeline_engine.train_batch_iter( [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default3]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default3]: output = model(**micro_batch) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]:Traceback (most recent call last): [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]: trainer.train(dataloader) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default1]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default1]: outputs = self.pipeline_engine.train_batch_iter( [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default1]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]: output = [default3]: return self._call_impl(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default3]: return forward_call(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default3]: sharded_logits = self.model( [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]: return self._call_impl(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default3]: return forward_call(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default3]: return self.forward_with_hiddenmodel(**micro_batch) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]: return self._call_impl(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]: return forward_call(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default1]: sharded_logits = self.model( _states(input_ids=input_ids, input_mask=input_mask)[0] [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]: return self._call_impl(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default3]: return forward_call(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]: return self._call_impl(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]: return forward_call(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default1]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default1]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_c[default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default3]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer all_impl [default1]: return self._call_impl(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]: return forward_call(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default1]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default1]: pipeline_state.run_communication() [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default1]: recv_activation_tensor = recv_activation() [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_p[default3]: pipeline_state.run_communication() [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default3]: recv_activation_tensor = recv_activation() [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default3]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] arallel/state.py", line 31, in __call__ [default1]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default1]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default1]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default1]: dist.recv( [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default3]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default1]: return func(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default1]: pg.recv([tensor], group_src_rank, tag).wait() [default1]:torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default3]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 269, in _recv_meta [default3]: dist.recv( [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default3]: return func(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default3]: pg.recv([tensor], group_src_rank, tag).wait() [default3]:torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. [default3]:[rank43]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default3]:[rank43]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down. [default3]:[rank43]:[E ProcessGroupNCCL.cpp:1182] [Rank 1] NCCL watchdog thread terminated with exception: NCCL error: remote process exited or there was a network error, NCCL version 2.19.3 [default3]:ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely. [default3]:Last error: [default3]:socketProgress: Connection closed by remote peer ip-26-0-161-142.ec2.internal<37264> [default3]:Exception raised from checkForNCCLErrorsInternal at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1436 (most recent call first): [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f4502d7bd87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: c10d::ProcessGroupNCCL::checkForNCCLErrorsInternal(std::vector, std::allocator > > const&) + 0x2f3 (0x7f4503f22fa3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #2: c10d::ProcessGroupNCCL::WorkNCCL::checkAndSetException() + 0x7b (0x7f4503f2327b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #3: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x17d (0x7f4503f26c1d in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #4: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f4503f27839 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #5: + 0xd3e95 (0x7f454dc2be95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default3]:frame #6: + 0x8609 (0x7f4552d33609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default3]:frame #7: clone + 0x43 (0x7f4552afe353 in /lib/x86_64-linux-gnu/libc.so.6) [default3]: [default3]:terminate called after throwing an instance of 'c10::DistBackendError' [default3]: what(): [Rank 1] NCCL watchdog thread terminated with exception: NCCL error: remote process exited or there was a network error, NCCL version 2.19.3 [default3]:ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely. [default3]:Last error: [default3]:socketProgress: Connection closed by remote peer ip-26-0-161-142.ec2.internal<37264> [default3]:Exception raised from checkForNCCLErrorsInternal at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1436 (most recent call first): [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f4502d7bd87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: c10d::ProcessGroupNCCL::checkForNCCLErrorsInternal(std::vector, std::allocator > > const&) + 0x2f3 (0x7f4503f22fa3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #2: c10d::ProcessGroupNCCL::WorkNCCL::checkAndSetException() + 0x7b (0x7f4503f2327b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #3: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x17d (0x7f4503f26c1d in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #4: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f4503f27839 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #5: + 0xd3e95 (0x7f454dc2be95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default3]:frame #6: + 0x8609 (0x7f4552d33609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default3]:frame #7: clone + 0x43 (0x7f4552afe353 in /lib/x86_64-linux-gnu/libc.so.6) [default3]: [default3]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1186 (most recent call first): [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f4502d7bd87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: + 0xdf6b11 (0x7f4503c7db11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #2: + 0xd3e95 (0x7f454dc2be95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default3]:frame #3: + 0x8609 (0x7f4552d33609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default3]:frame #4: clone + 0x43 (0x7f4552afe353 in /lib/x86_64-linux-gnu/libc.so.6) [default3]: [default1]:[rank41]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [default1]:[rank41]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down. [default1]:[rank41]:[E ProcessGroupNCCL.cpp:1182] [Rank 1] NCCL watchdog thread terminated with exception: NCCL error: remote process exited or there was a network error, NCCL version 2.19.3 [default1]:ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely. [default1]:Last error: [default1]:socketProgress: Connection closed by remote peer ip-26-0-161-142.ec2.internal<56212> [default1]:Exception raised from checkForNCCLErrorsInternal at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1436 (most recent call first): [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f6116b2cd87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:frame #1: c10d::ProcessGroupNCCL::checkForNCCLErrorsInternal(std::vector, std::allocator > > const&) + 0x2f3 (0x7f6117cd3fa3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #2: c10d::ProcessGroupNCCL::WorkNCCL::checkAndSetException() + 0x7b (0x7f6117cd427b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #3: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x17d (0x7f6117cd7c1d in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #4: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f6117cd8839 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #5: + 0xd3e95 (0x7f61619dce95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default1]:frame #6: + 0x8609 (0x7f6166ae4609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default1]:frame #7: clone + 0x43 (0x7f61668af353 in /lib/x86_64-linux-gnu/libc.so.6) [default1]: [default1]:terminate called after throwing an instance of 'c10::DistBackendError' [default1]: what(): [Rank 1] NCCL watchdog thread terminated with exception: NCCL error: remote process exited or there was a network error, NCCL version 2.19.3 [default1]:ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely. [default1]:Last error: [default1]:socketProgress: Connection closed by remote peer ip-26-0-161-142.ec2.internal<56212> [default1]:Exception raised from checkForNCCLErrorsInternal at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1436 (most recent call first): [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f6116b2cd87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:frame #1: c10d::ProcessGroupNCCL::checkForNCCLErrorsInternal(std::vector, std::allocator > > const&) + 0x2f3 (0x7f6117cd3fa3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #2: c10d::ProcessGroupNCCL::WorkNCCL::checkAndSetException() + 0x7b (0x7f6117cd427b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #3: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x17d (0x7f6117cd7c1d in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #4: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f6117cd8839 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #5: + 0xd3e95 (0x7f61619dce95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default1]:frame #6: + 0x8609 (0x7f6166ae4609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default1]:frame #7: clone + 0x43 (0x7f61668af353 in /lib/x86_64-linux-gnu/libc.so.6) [default1]: [default1]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1186 (most recent call first): [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f6116b2cd87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:frame #1: + 0xdf6b11 (0x7f6117a2eb11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #2: + 0xd3e95 (0x7f61619dce95 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/../lib/libstdc++.so.6) [default1]:frame #3: + 0x8609 (0x7f6166ae4609 in /lib/x86_64-linux-gnu/libpthread.so.0) [default1]:frame #4: clone + 0x43 (0x7f61668af353 in /lib/x86_64-linux-gnu/libc.so.6) [default1]: [2024-07-06 09:34:57,796] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 2 (pid: 627089) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-06_09:34:53 host : ip-26-0-160-192.ec2.internal rank : 2 (local_rank: 2) exitcode : 1 (pid: 627089) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ [2024-07-06 09:34:58,060] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-173-246.ec2.internal_286612_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. [2024-07-06 09:34:58,086] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-161-142.ec2.internal_644718_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. srun: error: ip-26-0-160-192: task 0: Exited with exit code 1 [2024-07-06 09:34:58,848] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-165-59.ec2.internal_118143_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. [2024-07-06 09:34:58,884] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-174-36.ec2.internal_1708240_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. [2024-07-06 09:34:58,905] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-165-131.ec2.internal_327422_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. [2024-07-06 09:34:58,925] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-163-134.ec2.internal_3621123_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. [2024-07-06 09:34:58,937] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-161-78.ec2.internal_226733_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. [2024-07-06 09:34:58,963] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 226803 closing signal SIGTERM [2024-07-06 09:34:58,964] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 226805 closing signal SIGTERM [2024-07-06 09:34:58,964] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 226807 closing signal SIGTERM [2024-07-06 09:34:58,968] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1708311 closing signal SIGTERM [2024-07-06 09:34:58,968] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1708313 closing signal SIGTERM [2024-07-06 09:34:58,969] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1708314 closing signal SIGTERM [2024-07-06 09:34:58,968] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 286682 closing signal SIGTERM [2024-07-06 09:34:58,968] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 286684 closing signal SIGTERM [2024-07-06 09:34:58,968] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 118214 closing signal SIGTERM [2024-07-06 09:34:58,969] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 118216 closing signal SIGTERM [2024-07-06 09:34:58,969] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 118218 closing signal SIGTERM [2024-07-06 09:34:58,969] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1708315 closing signal SIGTERM [2024-07-06 09:34:58,970] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1708316 closing signal SIGTERM [2024-07-06 09:34:58,969] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 118219 closing signal SIGTERM [2024-07-06 09:34:58,970] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 286686 closing signal SIGTERM [2024-07-06 09:34:58,972] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 286687 closing signal SIGTERM [2024-07-06 09:34:58,973] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1708317 closing signal SIGTERM [2024-07-06 09:34:58,973] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3621196 closing signal SIGTERM [2024-07-06 09:34:58,974] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3621198 closing signal SIGTERM [2024-07-06 09:34:58,975] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 286688 closing signal SIGTERM [2024-07-06 09:34:58,974] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3621200 closing signal SIGTERM [2024-07-06 09:34:58,975] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 286689 closing signal SIGTERM [2024-07-06 09:34:58,976] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 644788 closing signal SIGTERM [2024-07-06 09:34:58,977] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 644790 closing signal SIGTERM [2024-07-06 09:34:58,979] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1708318 closing signal SIGTERM [2024-07-06 09:34:58,978] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 644792 closing signal SIGTERM [2024-07-06 09:34:59,091] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 327504) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 [2024-07-06 09:34:59,143] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-165-131.ec2.internal_327422_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-07-06_09:34:58 host : ip-26-0-165-131.ec2.internal rank : 33 (local_rank: 1) exitcode : 1 (pid: 327505) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-07-06_09:34:58 host : ip-26-0-165-131.ec2.internal rank : 34 (local_rank: 2) exitcode : 1 (pid: 327506) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2024-07-06_09:34:58 host : ip-26-0-165-131.ec2.internal rank : 35 (local_rank: 3) exitcode : 1 (pid: 327507) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [4]: time : 2024-07-06_09:34:58 host : ip-26-0-165-131.ec2.internal rank : 36 (local_rank: 4) exitcode : 1 (pid: 327508) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [5]: time : 2024-07-06_09:34:58 host : ip-26-0-165-131.ec2.internal rank : 37 (local_rank: 5) exitcode : 1 (pid: 327509) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [6]: time : 2024-07-06_09:34:58 host : ip-26-0-165-131.ec2.internal rank : 38 (local_rank: 6) exitcode : 1 (pid: 327510) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [7]: time : 2024-07-06_09:34:58 host : ip-26-0-165-131.ec2.internal rank : 39 (local_rank: 7) exitcode : 1 (pid: 327511) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-06_09:34:58 host : ip-26-0-165-131.ec2.internal rank : 32 (local_rank: 0) exitcode : 1 (pid: 327504) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ srun: error: ip-26-0-165-131: task 5: Exited with exit code 1 [2024-07-06 09:35:01,192] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 226804) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 [2024-07-06 09:35:01,199] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 3621197) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 [2024-07-06 09:35:01,241] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-161-78.ec2.internal_226733_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-07-06_09:34:58 host : ip-26-0-161-78.ec2.internal rank : 19 (local_rank: 3) exitcode : 1 (pid: 226806) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-07-06_09:34:58 host : ip-26-0-161-78.ec2.internal rank : 21 (local_rank: 5) exitcode : 1 (pid: 226808) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2024-07-06_09:34:58 host : ip-26-0-161-78.ec2.internal rank : 22 (local_rank: 6) exitcode : 1 (pid: 226809) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [4]: time : 2024-07-06_09:34:58 host : ip-26-0-161-78.ec2.internal rank : 23 (local_rank: 7) exitcode : 1 (pid: 226810) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-06_09:34:58 host : ip-26-0-161-78.ec2.internal rank : 17 (local_rank: 1) exitcode : 1 (pid: 226804) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ [2024-07-06 09:35:01,257] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-163-134.ec2.internal_3621123_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-07-06_09:34:58 host : ip-26-0-163-134.ec2.internal rank : 27 (local_rank: 3) exitcode : 1 (pid: 3621199) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-07-06_09:34:58 host : ip-26-0-163-134.ec2.internal rank : 29 (local_rank: 5) exitcode : 1 (pid: 3621201) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2024-07-06_09:34:58 host : ip-26-0-163-134.ec2.internal rank : 30 (local_rank: 6) exitcode : 1 (pid: 3621202) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [4]: time : 2024-07-06_09:34:58 host : ip-26-0-163-134.ec2.internal rank : 31 (local_rank: 7) exitcode : 1 (pid: 3621203) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-06_09:34:58 host : ip-26-0-163-134.ec2.internal rank : 25 (local_rank: 1) exitcode : 1 (pid: 3621197) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ srun: error: ip-26-0-161-78: task 1: Exited with exit code 1 [2024-07-06 09:35:01,496] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 644789) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 [2024-07-06 09:35:01,554] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-161-142.ec2.internal_644718_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-07-06_09:34:58 host : ip-26-0-161-142.ec2.internal rank : 11 (local_rank: 3) exitcode : 1 (pid: 644791) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-07-06_09:34:58 host : ip-26-0-161-142.ec2.internal rank : 13 (local_rank: 5) exitcode : 1 (pid: 644793) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2024-07-06_09:34:58 host : ip-26-0-161-142.ec2.internal rank : 14 (local_rank: 6) exitcode : 1 (pid: 644794) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html srun: error: ip-26-0-163-134: task 3: Exited with exit code 1 [4]: time : 2024-07-06_09:34:58 host : ip-26-0-161-142.ec2.internal rank : 15 (local_rank: 7) exitcode : 1 (pid: 644795) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-06_09:34:58 host : ip-26-0-161-142.ec2.internal rank : 9 (local_rank: 1) exitcode : 1 (pid: 644789) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ [2024-07-06 09:35:01,801] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 1 (pid: 118215) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 [2024-07-06 09:35:01,849] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-165-59.ec2.internal_118143_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-07-06_09:34:58 host : ip-26-0-165-59.ec2.internal rank : 43 (local_rank: 3) exitcode : -6 (pid: 118217) error_file: traceback : Signal 6 (SIGABRT) received by PID 118217 [2]: time : 2024-07-06_09:34:58 host : ip-26-0-165-59.ec2.internal rank : 46 (local_rank: 6) exitcode : 1 (pid: 118220) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2024-07-06_09:34:58 host : ip-26-0-165-59.ec2.internal rank : 47 (local_rank: 7) exitcode : 1 (pid: 118221) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-06_09:34:58 host : ip-26-0-165-59.ec2.internal rank : 41 (local_rank: 1) exitcode : -6 (pid: 118215) error_file: traceback : Signal 6 (SIGABRT) received by PID 118215 ============================================================ srun: error: ip-26-0-161-142: task 2: Exited with exit code 1 srun: error: ip-26-0-165-59: task 4: Exited with exit code 1 [2024-07-06 09:35:02,197] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 286683) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 [2024-07-06 09:35:02,204] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 1708312) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 [2024-07-06 09:35:02,236] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-173-246.ec2.internal_286612_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper [2024-07-06 09:35:02,238] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-174-36.ec2.internal_1708240_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. return f(*args, **kwargs) Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper run(args) return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__ File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-07-06_09:34:58 host : ip-26-0-173-246.ec2.internal rank : 51 (local_rank: 3) exitcode : 1 (pid: 286685) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-06_09:34:58 host : ip-26-0-173-246.ec2.internal rank : 49 (local_rank: 1) exitcode : 1 (pid: 286683) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-06_09:34:58 host : ip-26-0-174-36.ec2.internal rank : 57 (local_rank: 1) exitcode : 1 (pid: 1708312) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ srun: error: ip-26-0-173-246: task 6: Exited with exit code 1 srun: error: ip-26-0-174-36: task 7: Exited with exit code 1 Consider using `hf_transfer` for faster uploads. This solution comes with some limitations. See https://huggingface.co/docs/huggingface_hub/hf_transfer for more details.