|
======================== |
|
START TIME: Wed Jul 3 03:05:18 UTC 2024 |
|
python3 version = Python 3.10.14 |
|
======================== |
|
The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well. |
|
Token is valid (permission: write). |
|
Your token has been saved to /admin/home/ferdinand_mom/.cache/huggingface/token |
|
Login successful |
|
Already on 'bench_cluster' |
|
M examples/config_tiny_llama.py |
|
M examples/config_tiny_llama.yaml |
|
M examples/train_tiny_llama.sh |
|
M src/nanotron/models/llama.py |
|
M src/nanotron/trainer.py |
|
Your branch is up to date with 'origin/bench_cluster'. |
|
Job status: RUNNING |
|
W0703 03:05:26.510000 140551663458112 torch/distributed/run.py:757] |
|
W0703 03:05:26.510000 140551663458112 torch/distributed/run.py:757] ***************************************** |
|
W0703 03:05:26.510000 140551663458112 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. |
|
W0703 03:05:26.510000 140551663458112 torch/distributed/run.py:757] ***************************************** |
|
W0703 03:05:26.563000 140057300637504 torch/distributed/run.py:757] |
|
W0703 03:05:26.563000 140057300637504 torch/distributed/run.py:757] ***************************************** |
|
W0703 03:05:26.563000 140057300637504 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. |
|
W0703 03:05:26.563000 140057300637504 torch/distributed/run.py:757] ***************************************** |
|
W0703 03:05:26.864000 140472173479744 torch/distributed/run.py:757] |
|
W0703 03:05:26.864000 140472173479744 torch/distributed/run.py:757] ***************************************** |
|
W0703 03:05:26.864000 140472173479744 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. |
|
W0703 03:05:26.864000 140472173479744 torch/distributed/run.py:757] ***************************************** |
|
W0703 03:05:26.882000 140185577748288 torch/distributed/run.py:757] |
|
W0703 03:05:26.882000 140185577748288 torch/distributed/run.py:757] ***************************************** |
|
W0703 03:05:26.882000 140185577748288 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. |
|
W0703 03:05:26.882000 140185577748288 torch/distributed/run.py:757] ***************************************** |
|
W0703 03:05:26.887000 140357741373248 torch/distributed/run.py:757] |
|
W0703 03:05:26.887000 140357741373248 torch/distributed/run.py:757] ***************************************** |
|
W0703 03:05:26.887000 140357741373248 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. |
|
W0703 03:05:26.887000 140357741373248 torch/distributed/run.py:757] ***************************************** |
|
W0703 03:05:26.891000 140159660582720 torch/distributed/run.py:757] |
|
W0703 03:05:26.891000 140159660582720 torch/distributed/run.py:757] ***************************************** |
|
W0703 03:05:26.891000 140159660582720 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. |
|
W0703 03:05:26.891000 140159660582720 torch/distributed/run.py:757] ***************************************** |
|
W0703 03:05:26.928000 140086511552320 torch/distributed/run.py:757] |
|
W0703 03:05:26.928000 140086511552320 torch/distributed/run.py:757] ***************************************** |
|
W0703 03:05:26.928000 140086511552320 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. |
|
W0703 03:05:26.928000 140086511552320 torch/distributed/run.py:757] ***************************************** |
|
W0703 03:05:26.932000 140132832876352 torch/distributed/run.py:757] |
|
W0703 03:05:26.932000 140132832876352 torch/distributed/run.py:757] ***************************************** |
|
W0703 03:05:26.932000 140132832876352 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. |
|
W0703 03:05:26.932000 140132832876352 torch/distributed/run.py:757] ***************************************** |
|
[default0]:07/03/2024 03:05:52 [WARNING|DP=0|PP=0|TP=0|ip-26-0-161-103]: [Vocab Size Padding] Padded vocab (size: 50257) with 47 dummy tokens (new size: 50304) |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: Config: |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: Config(general=GeneralArgs(project='bench_cluster', |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: run='%date_%jobid', |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: seed=42, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: step=None, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: consumed_train_samples=None, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: benchmark_csv_path=None, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: ignore_sanity_checks=True), |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: parallelism=ParallelismArgs(dp=1, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: pp=1, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: tp=64, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: pp_engine=<nanotron.parallel.pipeline_parallel.engine.OneForwardOneBackwardPipelineEngine object at 0x7fa162120700>, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: tp_mode=<TensorParallelLinearMode.REDUCE_SCATTER: 2>, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: tp_linear_async_communication=False, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: expert_parallel_size=1), |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: model=ModelArgs(model_config=LlamaConfig(bos_token_id=1, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: eos_token_id=2, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: hidden_act='silu', |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: hidden_size=2048, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: initializer_range=0.02, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: intermediate_size=4096, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: is_llama_config=True, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: max_position_embeddings=4096, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: num_attention_heads=32, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: num_hidden_layers=24, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: num_key_value_heads=32, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: pad_token_id=None, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: pretraining_tp=1, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: rms_norm_eps=1e-05, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: rope_scaling=None, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: rope_theta=10000.0, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: tie_word_embeddings=True, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: use_cache=True, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: vocab_size=50304), |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: init_method=RandomInit(std=0.025), |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: dtype=torch.bfloat16, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: make_vocab_size_divisible_by=1, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: ddp_bucket_cap_mb=25), |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: tokenizer=TokenizerArgs(tokenizer_name_or_path='openai-community/gpt2', |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: tokenizer_revision=None, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: tokenizer_max_length=None), |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: checkpoints=CheckpointsArgs(checkpoints_path=Path('/dev/null'), |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: checkpoint_interval=100000, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: save_initial_state=False, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: resume_checkpoint_path=None, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: checkpoints_path_is_shared_file_system=False), |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: logging=LoggingArgs(log_level='info', |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: log_level_replica='info', |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: iteration_step_info_interval=1), |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: tokens=TokensArgs(sequence_length=4096, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: train_steps=20, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: micro_batch_size=4, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: batch_accumulation_per_replica=256, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: val_check_interval=-1, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: limit_val_batches=0, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: limit_test_batches=0), |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: optimizer=OptimizerArgs(optimizer_factory=AdamWOptimizerArgs(adam_eps=1e-08, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: adam_beta1=0.9, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: adam_beta2=0.95, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: torch_adam_is_fused=True, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: name='adamW'), |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: zero_stage=1, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: weight_decay=0.01, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: clip_grad=1.0, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: accumulate_grad_in_fp32=True, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: learning_rate_scheduler=LRSchedulerArgs(learning_rate=0.0001, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: lr_warmup_steps=1, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: lr_warmup_style='linear', |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: lr_decay_style='linear', |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: lr_decay_steps=19, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: lr_decay_starting_step=None, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: min_decay_lr=1e-05)), |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: data_stages=[DatasetStageArgs(name='Training Stage', |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: start_training_step=1, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: data=DataArgs(dataset=PretrainDatasetsArgs(hf_dataset_or_datasets='roneneldan/TinyStories', |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: hf_dataset_splits='train', |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: hf_dataset_config_name=None, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: dataset_processing_num_proc_per_process=64, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: dataset_overwrite_cache=False, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: text_column_name='text'), |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: seed=42, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: num_loading_workers=0))], |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: profiler=ProfilerArgs(profiler_export_path=Path('/fsx/ferdinandmom/ferdinand-hf/bench_cluster/results/llama-1B/64_GPUS/dp-1_tp-64_pp-1_mbz-4')), |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: lighteval=None) |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: Model Config: |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: LlamaConfig(bos_token_id=1, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: eos_token_id=2, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: hidden_act='silu', |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: hidden_size=2048, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: initializer_range=0.02, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: intermediate_size=4096, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: is_llama_config=True, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: max_position_embeddings=4096, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: num_attention_heads=32, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: num_hidden_layers=24, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: num_key_value_heads=32, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: pad_token_id=None, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: pretraining_tp=1, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: rms_norm_eps=1e-05, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: rope_scaling=None, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: rope_theta=10000.0, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: tie_word_embeddings=True, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: use_cache=True, |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: vocab_size=50304) |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: Building model.. |
|
[default0]:07/03/2024 03:05:52 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-103]: Setting PP block ranks... |
|
[default1]:[rank25]: Traceback (most recent call last): |
|
[default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in <module> |
|
[default1]:[rank25]: trainer = DistributedTrainer(config_file) |
|
[default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 172, in __init__ |
|
[default1]:[rank25]: self.model = self.init_model() # Defines self.model |
|
[default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 672, in init_model |
|
[default1]:[rank25]: model = self._init_model_instance() |
|
[default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 682, in _init_model_instance |
|
[default1]:[rank25]: model = self._init_model( |
|
[default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 751, in _init_model |
|
[default1]:[rank25]: model = build_model( |
|
[default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/base.py", line 230, in build_model |
|
[default1]:[rank25]: block.build_and_set_rank(target_pp_ranks[target_pp_rank_idx]) |
|
[default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 52, in build_and_set_rank |
|
[default1]:[rank25]: self.pp_block = self.module_builder(**self.module_kwargs) |
|
[default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 613, in __init__ |
|
[default1]:[rank25]: self.attn = CausalSelfAttention( |
|
[default1]:[rank25]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 271, in __init__ |
|
[default1]:[rank25]: config.num_attention_heads % tp_pg.size() == 0 |
|
[default1]:[rank25]: AssertionError: Number of attention heads (32) must be divisible by TP size (64). |
|
[default0]:[rank56]: Traceback (most recent call last): |
|
[default0]:[rank56]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in <module> |
|
[default0]:[rank56]: trainer = DistributedTrainer(config_file) |
|
[default0]:[rank56]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 172, in __init__ |
|
[default0]:[rank56]: self.model = self.init_model() # Defines self.model |
|
[default0]:[rank56]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 672, in init_model |
|
[default0]:[rank56]: model = self._init_model_instance() |
|
[default0]:[rank56]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 682, in _init_model_instance |
|
[default0]:[rank56]: model = self._init_model( |
|
[default0]:[rank56]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 751, in _init_model |
|
[default0]:[rank56]: model = build_model( |
|
[default0]:[rank56]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/base.py", line 230, in build_model |
|
[default0]:[rank56]: block.build_and_set_rank(target_pp_ranks[target_pp_rank_idx]) |
|
[default0]:[rank56]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 52, in build_and_set_rank |
|
[default0]:[rank56]: self.pp_block = self.module_builder(**self.module_kwargs) |
|
[default0]:[rank56]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 613, in __init__ |
|
[default0]:[rank56]: self.attn = CausalSelfAttention( |
|
[default0]:[rank56]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 271, in __init__ |
|
[default0]:[rank56]: config.num_attention_heads % tp_pg.size() == 0 |
|
[default0]:[rank56]: AssertionError: Number of attention heads (32) must be divisible by TP size (64). |
|
[default5]:[rank61]: Traceback (most recent call last): |
|
[default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in <module> |
|
[default5]:[rank61]: trainer = DistributedTrainer(config_file) |
|
[default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 172, in __init__ |
|
[default5]:[rank61]: self.model = self.init_model() # Defines self.model |
|
[default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 672, in init_model |
|
[default5]:[rank61]: model = self._init_model_instance() |
|
[default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 682, in _init_model_instance |
|
[default5]:[rank61]: model = self._init_model( |
|
[default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 751, in _init_model |
|
[default5]:[rank61]: model = build_model( |
|
[default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/base.py", line 230, in build_model |
|
[default5]:[rank61]: block.build_and_set_rank(target_pp_ranks[target_pp_rank_idx]) |
|
[default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 52, in build_and_set_rank |
|
[default5]:[rank61]: self.pp_block = self.module_builder(**self.module_kwargs) |
|
[default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 613, in __init__ |
|
[default5]:[rank61]: self.attn = CausalSelfAttention( |
|
[default5]:[rank61]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 271, in __init__ |
|
[default5]:[rank61]: config.num_attention_heads % tp_pg.size() == 0 |
|
[default5]:[rank61]: AssertionError: Number of attention heads (32) must be divisible by TP size (64). |
|
[default7]:[rank31]: Traceback (most recent call last): |
|
[default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in <module> |
|
[default7]:[rank31]: trainer = DistributedTrainer(config_file) |
|
[default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 172, in __init__ |
|
[default7]:[rank31]: self.model = self.init_model() # Defines self.model |
|
[default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 672, in init_model |
|
[default7]:[rank31]: model = self._init_model_instance() |
|
[default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 682, in _init_model_instance |
|
[default7]:[rank31]: model = self._init_model( |
|
[default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 751, in _init_model |
|
[default7]:[rank31]: model = build_model( |
|
[default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/base.py", line 230, in build_model |
|
[default7]:[rank31]: block.build_and_set_rank(target_pp_ranks[target_pp_rank_idx]) |
|
[default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 52, in build_and_set_rank |
|
[default7]:[rank31]: self.pp_block = self.module_builder(**self.module_kwargs) |
|
[default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 613, in __init__ |
|
[default7]:[rank31]: self.attn = CausalSelfAttention( |
|
[default7]:[rank31]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 271, in __init__ |
|
[default7]:[rank31]: config.num_attention_heads % tp_pg.size() == 0 |
|
[default7]:[rank31]: AssertionError: Number of attention heads (32) must be divisible by TP size (64). |
|
[default3]:[rank27]: Traceback (most recent call last): |
|
[default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in <module> |
|
[default3]:[rank27]: trainer = DistributedTrainer(config_file) |
|
[default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 172, in __init__ |
|
[default3]:[rank27]: self.model = self.init_model() # Defines self.model |
|
[default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 672, in init_model |
|
[default3]:[rank27]: model = self._init_model_instance() |
|
[default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 682, in _init_model_instance |
|
[default3]:[rank27]: model = self._init_model( |
|
[default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 751, in _init_model |
|
[default4]:[rank60]: Traceback (most recent call last): |
|
[default3]:[rank27]: model = build_model( |
|
[default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/base.py", line 230, in build_model |
|
[default3]:[rank27]: block.build_and_set_rank(target_pp_ranks[target_pp_rank_idx]) |
|
[default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 52, in build_and_set_rank |
|
[default3]:[rank27]: self.pp_block = self.module_builder(**self.module_kwargs) |
|
[default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 613, in __init__ |
|
[default3]:[rank27]: self.attn = CausalSelfAttention( |
|
[default3]:[rank27]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 271, in __init__ |
|
[default3]:[rank27]: config.num_attention_heads % tp_pg.size() == 0 |
|
[default3]:[rank27]: AssertionError: Number of attention heads (32) must be divisible by TP size (64). |
|
[default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in <module> |
|
[default4]:[rank60]: trainer = DistributedTrainer(config_file) |
|
[default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 172, in __init__ |
|
[default4]:[rank60]: self.model = self.init_model() # Defines self.model |
|
[default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 672, in init_model |
|
[default3]:[rank59]: Traceback (most recent call last): |
|
[default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in <module> |
|
[default4]:[rank60]: model = self._init_model_instance() |
|
[default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 682, in _init_model_instance |
|
[default4]:[rank60]: model = self._init_model( |
|
[default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 751, in _init_model |
|
[default4]:[rank60]: model = build_model( |
|
[default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/base.py", line 230, in build_model |
|
[default3]:[rank59]: trainer = DistributedTrainer(config_file) |
|
[default4]:[rank60]: block.build_and_set_rank(target_pp_ranks[target_pp_rank_idx]) |
|
[default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 52, in build_and_set_rank |
|
[default4]:[rank60]: self.pp_block = self.module_builder(**self.module_kwargs) |
|
[default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 613, in __init__ |
|
[default4]:[rank60]: self.attn = CausalSelfAttention( |
|
[default4]:[rank60]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 271, in __init__ |
|
[default2]:[rank58]: Traceback (most recent call last): |
|
[default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 172, in __init__ |
|
[default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in <module> |
|
[default2]:[rank58]: trainer = DistributedTrainer(config_file) |
|
[default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 172, in __init__ |
|
[default4]:[rank60]: config.num_attention_heads % tp_pg.size() == 0 |
|
[default4]:[rank60]: AssertionError: Number of attention heads (32) must be divisible by TP size (64). |
|
[default2]:[rank58]: self.model = self.init_model() # Defines self.model |
|
[default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 672, in init_model |
|
[default2]:[rank58]: model = self._init_model_instance() |
|
[default3]:[rank59]: self.model = self.init_model() # Defines self.model |
|
[default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 672, in init_model |
|
[default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 682, in _init_model_instance |
|
[default3]:[rank59]: model = self._init_model_instance() |
|
[default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 682, in _init_model_instance |
|
[default3]:[rank59]: model = self._init_model( |
|
[default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 751, in _init_model |
|
[default3]:[rank59]: model = build_model( |
|
[default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/base.py", line 230, in build_model |
|
[default3]:[rank59]: block.build_and_set_rank(target_pp_ranks[target_pp_rank_idx]) |
|
[default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 52, in build_and_set_rank |
|
[default3]:[rank59]: self.pp_block = self.module_builder(**self.module_kwargs) |
|
[default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 613, in __init__ |
|
[default3]:[rank59]: self.attn = CausalSelfAttention( |
|
[default2]:[rank58]: model = self._init_model( |
|
[default3]:[rank59]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 271, in __init__ |
|
[default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 751, in _init_model |
|
[default3]:[rank59]: config.num_attention_heads % tp_pg.size() == 0 |
|
[default3]:[rank59]: AssertionError: Number of attention heads (32) must be divisible by TP size (64). |
|
[default2]:[rank58]: model = build_model( |
|
[default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/base.py", line 230, in build_model |
|
[default2]:[rank58]: block.build_and_set_rank(target_pp_ranks[target_pp_rank_idx]) |
|
[default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 52, in build_and_set_rank |
|
[default2]:[rank58]: self.pp_block = self.module_builder(**self.module_kwargs) |
|
[default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 613, in __init__ |
|
[default2]:[rank58]: self.attn = CausalSelfAttention( |
|
[default2]:[rank58]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 271, in __init__ |
|
[default2]:[rank58]: config.num_attention_heads % tp_pg.size() == 0 |
|
[default2]:[rank58]: AssertionError: Number of attention heads (32) must be divisible by TP size (64). |
|
[default4]:[rank28]: Traceback (most recent call last): |
|
[default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in <module> |
|
[default4]:[rank28]: trainer = DistributedTrainer(config_file) |
|
[default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 172, in __init__ |
|
[default4]:[rank28]: self.model = self.init_model() # Defines self.model |
|
[default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 672, in init_model |
|
[default4]:[rank28]: model = self._init_model_instance() |
|
[default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 682, in _init_model_instance |
|
[default4]:[rank28]: model = self._init_model( |
|
[default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 751, in _init_model |
|
[default4]:[rank28]: model = build_model( |
|
[default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/base.py", line 230, in build_model |
|
[default4]:[rank28]: block.build_and_set_rank(target_pp_ranks[target_pp_rank_idx]) |
|
[default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 52, in build_and_set_rank |
|
[default4]:[rank28]: self.pp_block = self.module_builder(**self.module_kwargs) |
|
[default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 613, in __init__ |
|
[default4]:[rank28]: self.attn = CausalSelfAttention( |
|
[default4]:[rank28]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 271, in __init__ |
|
[default4]:[rank28]: config.num_attention_heads % tp_pg.size() == 0 |
|
[default4]:[rank28]: AssertionError: Number of attention heads (32) must be divisible by TP size (64). |
|
[default6]:[rank30]: Traceback (most recent call last): |
|
[default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in <module> |
|
[default6]:[rank30]: trainer = DistributedTrainer(config_file) |
|
[default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 172, in __init__ |
|
[default6]:[rank30]: self.model = self.init_model() # Defines self.model |
|
[default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 672, in init_model |
|
[default6]:[rank30]: model = self._init_model_instance() |
|
[default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 682, in _init_model_instance |
|
[default6]:[rank30]: model = self._init_model( |
|
[default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 751, in _init_model |
|
[default6]:[rank30]: model = build_model( |
|
[default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/base.py", line 230, in build_model |
|
[default6]:[rank30]: block.build_and_set_rank(target_pp_ranks[target_pp_rank_idx]) |
|
[default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 52, in build_and_set_rank |
|
[default6]:[rank30]: self.pp_block = self.module_builder(**self.module_kwargs) |
|
[default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 613, in __init__ |
|
[default6]:[rank30]: self.attn = CausalSelfAttention( |
|
[default6]:[rank30]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 271, in __init__ |
|
[default6]:[rank30]: config.num_attention_heads % tp_pg.size() == 0 |
|
[default6]:[rank62]: Traceback (most recent call last): |
|
[default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in <module> |
|
[default6]:[rank30]: AssertionError: Number of attention heads (32) must be divisible by TP size (64). |
|
[default5]:[rank29]: Traceback (most recent call last): |
|
[default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in <module> |
|
[default5]:[rank29]: trainer = DistributedTrainer(config_file) |
|
[default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 172, in __init__ |
|
[default5]:[rank29]: self.model = self.init_model() # Defines self.model |
|
[default6]:[rank62]: trainer = DistributedTrainer(config_file) |
|
[default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 672, in init_model |
|
[default5]:[rank29]: model = self._init_model_instance() |
|
[default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 172, in __init__ |
|
[default6]:[rank62]: self.model = self.init_model() # Defines self.model |
|
[default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 672, in init_model |
|
[default6]:[rank62]: model = self._init_model_instance() |
|
[default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 682, in _init_model_instance |
|
[default6]:[rank62]: model = self._init_model( |
|
[default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 751, in _init_model |
|
[default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 682, in _init_model_instance |
|
[default5]:[rank29]: model = self._init_model( |
|
[default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 751, in _init_model |
|
[default5]:[rank29]: model = build_model( |
|
[default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/base.py", line 230, in build_model |
|
[default6]:[rank62]: model = build_model( |
|
[default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/base.py", line 230, in build_model |
|
[default5]:[rank29]: block.build_and_set_rank(target_pp_ranks[target_pp_rank_idx]) |
|
[default6]:[rank62]: block.build_and_set_rank(target_pp_ranks[target_pp_rank_idx]) |
|
[default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 52, in build_and_set_rank |
|
[default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 52, in build_and_set_rank |
|
[default5]:[rank29]: self.pp_block = self.module_builder(**self.module_kwargs) |
|
[default6]:[rank62]: self.pp_block = self.module_builder(**self.module_kwargs) |
|
[default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 613, in __init__ |
|
[default6]:[rank62]: self.attn = CausalSelfAttention( |
|
[default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 613, in __init__ |
|
[default5]:[rank29]: self.attn = CausalSelfAttention( |
|
[default5]:[rank29]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 271, in __init__ |
|
[default5]:[rank29]: config.num_attention_heads % tp_pg.size() == 0 |
|
[default6]:[rank62]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 271, in __init__ |
|
[default6]:[rank62]: config.num_attention_heads % tp_pg.size() == 0 |
|
[default6]:[rank62]: AssertionError: Number of attention heads (32) must be divisible by TP size (64). |
|
[default5]:[rank29]: AssertionError: Number of attention heads (32) must be divisible by TP size (64). |
|
[default7]:[rank63]: Traceback (most recent call last): |
|
[default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in <module> |
|
[default7]:[rank63]: trainer = DistributedTrainer(config_file) |
|
[default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 172, in __init__ |
|
[default7]:[rank63]: self.model = self.init_model() # Defines self.model |
|
[default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 672, in init_model |
|
[default7]:[rank63]: model = self._init_model_instance() |
|
[default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 682, in _init_model_instance |
|
[default7]:[rank63]: model = self._init_model( |
|
[default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 751, in _init_model |
|
[default7]:[rank63]: model = build_model( |
|
[default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/base.py", line 230, in build_model |
|
[default7]:[rank63]: block.build_and_set_rank(target_pp_ranks[target_pp_rank_idx]) |
|
[default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 52, in build_and_set_rank |
|
[default7]:[rank63]: self.pp_block = self.module_builder(**self.module_kwargs) |
|
[default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 613, in __init__ |
|
[default7]:[rank63]: self.attn = CausalSelfAttention( |
|
[default7]:[rank63]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 271, in __init__ |
|
[default7]:[rank63]: config.num_attention_heads % tp_pg.size() == 0 |
|
[default7]:[rank63]: AssertionError: Number of attention heads (32) must be divisible by TP size (64). |
|
[default0]:[rank24]: Traceback (most recent call last): |
|
[default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in <module> |
|
[default0]:[rank24]: trainer = DistributedTrainer(config_file) |
|
[default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 172, in __init__ |
|
[default0]:[rank24]: self.model = self.init_model() # Defines self.model |
|
[default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 672, in init_model |
|
[default0]:[rank24]: model = self._init_model_instance() |
|
[default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 682, in _init_model_instance |
|
[default0]:[rank24]: model = self._init_model( |
|
[default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 751, in _init_model |
|
[default0]:[rank24]: model = build_model( |
|
[default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/base.py", line 230, in build_model |
|
[default0]:[rank24]: block.build_and_set_rank(target_pp_ranks[target_pp_rank_idx]) |
|
[default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 52, in build_and_set_rank |
|
[default0]:[rank24]: self.pp_block = self.module_builder(**self.module_kwargs) |
|
[default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 613, in __init__ |
|
[default0]:[rank24]: self.attn = CausalSelfAttention( |
|
[default0]:[rank24]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 271, in __init__ |
|
[default0]:[rank24]: config.num_attention_heads % tp_pg.size() == 0 |
|
[default0]:[rank24]: AssertionError: Number of attention heads (32) must be divisible by TP size (64). |
|
[default2]:[rank26]: Traceback (most recent call last): |
|
[default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in <module> |
|
[default2]:[rank26]: trainer = DistributedTrainer(config_file) |
|
[default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 172, in __init__ |
|
[default2]:[rank26]: self.model = self.init_model() # Defines self.model |
|
[default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 672, in init_model |
|
[default2]:[rank26]: model = self._init_model_instance() |
|
[default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 682, in _init_model_instance |
|
[default2]:[rank26]: model = self._init_model( |
|
[default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 751, in _init_model |
|
[default2]:[rank26]: model = build_model( |
|
[default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/base.py", line 230, in build_model |
|
[default2]:[rank26]: block.build_and_set_rank(target_pp_ranks[target_pp_rank_idx]) |
|
[default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 52, in build_and_set_rank |
|
[default2]:[rank26]: self.pp_block = self.module_builder(**self.module_kwargs) |
|
[default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 613, in __init__ |
|
[default2]:[rank26]: self.attn = CausalSelfAttention( |
|
[default2]:[rank26]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 271, in __init__ |
|
[default2]:[rank26]: config.num_attention_heads % tp_pg.size() == 0 |
|
[default2]:[rank26]: AssertionError: Number of attention heads (32) must be divisible by TP size (64). |
|
[default2]:[rank18]: Traceback (most recent call last): |
|
[default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in <module> |
|
[default2]:[rank18]: trainer = DistributedTrainer(config_file) |
|
[default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 172, in __init__ |
|
[default2]:[rank18]: self.model = self.init_model() # Defines self.model |
|
[default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 672, in init_model |
|
[default2]:[rank18]: model = self._init_model_instance() |
|
[default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 682, in _init_model_instance |
|
[default2]:[rank18]: model = self._init_model( |
|
[default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 751, in _init_model |
|
[default2]:[rank18]: model = build_model( |
|
[default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/base.py", line 230, in build_model |
|
[default2]:[rank18]: block.build_and_set_rank(target_pp_ranks[target_pp_rank_idx]) |
|
[default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 52, in build_and_set_rank |
|
[default2]:[rank18]: self.pp_block = self.module_builder(**self.module_kwargs) |
|
[default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 613, in __init__ |
|
[default2]:[rank18]: self.attn = CausalSelfAttention( |
|
[default2]:[rank18]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 271, in __init__ |
|
[default2]:[rank18]: config.num_attention_heads % tp_pg.size() == 0 |
|
[default2]:[rank18]: AssertionError: Number of attention heads (32) must be divisible by TP size (64). |
|
[default2]:[rank10]: Traceback (most recent call last): |
|
[default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in <module> |
|
[default2]:[rank10]: trainer = DistributedTrainer(config_file) |
|
[default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 172, in __init__ |
|
[default2]:[rank10]: self.model = self.init_model() # Defines self.model |
|
[default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 672, in init_model |
|
[default2]:[rank10]: model = self._init_model_instance() |
|
[default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 682, in _init_model_instance |
|
[default2]:[rank10]: model = self._init_model( |
|
[default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 751, in _init_model |
|
[default2]:[rank10]: model = build_model( |
|
[def[default0]:[rank48]: Traceback (most recent call last): |
|
[default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in <module> |
|
[default0]:[rank48]: trainer = DistributedTrainer(config_file) |
|
[default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 172, in __init__ |
|
[default0]:[rank48]: self.model = self.init_model() # Defines self.model |
|
[default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 672, in init_model |
|
[default0]:[rank48]: model = self._init_model_instance() |
|
[default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 682, in _init_model_instance |
|
[default0]:[rank48]: model = self._init_model( |
|
[default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 751, in _init_model |
|
[default0]:[rank48]: model = build_model( |
|
[default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/base.py", line 230, in build_model |
|
[default2]:[rank10]: block.build_and_set_rank(target_pp_ranks[target_pp_rank_idx]) |
|
[default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 52, in build_and_set_rank |
|
[default2]:[rank10]: self.pp_block = self.module_builder(**self.module_kwargs) |
|
[default0]:[rank16]: Traceback (most recent call last): |
|
[default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in <module> |
|
[default0]:[rank16]: trainer = DistributedTrainer(config_file) |
|
[default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 172, in __init__ |
|
[default0]:[rank16]: self.model = self.init_model() # Defines self.model |
|
[default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 672, in init_model |
|
[default0]:[rank16]: model = self._init_model_instance() |
|
[default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 682, in _init_model_instance |
|
[default0]:[rank16]: model = self._init_model( |
|
[default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 751, in _init_model |
|
[default0]:[rank16]: model = build_model( |
|
[default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/base.py", line 230, in build_model |
|
[default0]:[rank48]: block.build_and_set_rank(target_pp_ranks[target_pp_rank_idx]) |
|
[default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 52, in build_and_set_rank |
|
[default0]:[rank48]: self.pp_block = self.module_builder(**self.module_kwargs) |
|
[default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 613, in __init__ |
|
[default0]:[rank48]: self.attn = CausalSelfAttention( |
|
[default0]:[rank48]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 271, in __init__ |
|
[default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 613, in __init__ |
|
[default2]:[rank10]: self.attn = CausalSelfAttention( |
|
[default2]:[rank10]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 271, in __init__ |
|
[default2]:[rank10]: config.num_attention_heads % tp_pg.size() == 0 |
|
[default2]:[rank10]: AssertionError: Number of attention heads (32) must be divisible by TP size (64). |
|
[default0]:[rank48]: config.num_attention_heads % tp_pg.size() == 0 |
|
[default0]:[rank48]: AssertionError: Number of attention heads (32) must be divisible by TP size (64). |
|
ault0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/base.py", line 230, in build_model |
|
[default0]:[rank16]: block.build_and_set_rank(target_pp_ranks[target_pp_rank_idx]) |
|
[default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 52, in build_and_set_rank |
|
[default0]:[rank16]: self.pp_block = self.module_builder(**self.module_kwargs) |
|
[default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 613, in __init__ |
|
[default0]:[rank16]: self.attn = CausalSelfAttention( |
|
[default0]:[rank16]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 271, in __init__ |
|
[default0]:[rank16]: config.num_attention_heads % tp_pg.size() == 0 |
|
[default0]:[rank16]: AssertionError: Number of attention heads (32) must be divisible by TP size (64). |
|
[default1]:[rank57]: Traceback (most recent call last): |
|
[default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in <module> |
|
[default1]:[rank57]: trainer = DistributedTrainer(config_file) |
|
[default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 172, in __init__ |
|
[default1]:[rank57]: self.model = self.init_model() # Defines self.model |
|
[default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 672, in init_model |
|
[default1]:[rank57]: model = self._init_model_instance() |
|
[default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 682, in _init_model_instance |
|
[default1]:[rank57]: model = self._init_model( |
|
[default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 751, in _init_model |
|
[default1]:[rank57]: model = build_model( |
|
[default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/base.py", line 230, in build_model |
|
[default1]:[rank57]: block.build_and_set_rank(target_pp_ranks[target_pp_rank_idx]) |
|
[default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 52, in build_and_set_rank |
|
[default1]:[rank57]: self.pp_block = self.module_builder(**self.module_kwargs) |
|
[default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 613, in __init__ |
|
[default1]:[rank57]: self.attn = CausalSelfAttention( |
|
[default1]:[rank57]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 271, in __init__ |
|
[default1]:[rank57]: config.num_attention_heads % tp_pg.size() == 0 |
|
[default1]:[rank57]: AssertionError: Number of attention heads (32) must be divisible by TP size (64). |
|
[default7]:[rank39]: Traceback (most recent call last): |
|
[default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in <module> |
|
[default7]:[rank39]: trainer = DistributedTrainer(config_file) |
|
[default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 172, in __init__ |
|
[default7]:[rank39]: self.model = self.init_model() # Defines self.model |
|
[default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 672, in init_model |
|
[default7]:[rank39]: model = self._init_model_instance() |
|
[default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 682, in _init_model_instance |
|
[default7]:[rank39]: model = self._init_model( |
|
[default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 751, in _init_model |
|
[default7]:[rank39]: model = build_model( |
|
[default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/base.py", line 230, in build_model |
|
[default7]:[rank39]: block.build_and_set_rank(target_pp_ranks[target_pp_rank_idx]) |
|
[default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 52, in build_and_set_rank |
|
[default7]:[rank39]: self.pp_block = self.module_builder(**self.module_kwargs) |
|
[default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 613, in __init__ |
|
[default7]:[rank39]: self.attn = CausalSelfAttention( |
|
[default7]:[rank39]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 271, in __init__ |
|
[default7]:[rank39]: config.num_attention_heads % tp_pg.size() == 0 |
|
[default7]:[rank39]: AssertionError: Number of attention heads (32) must be divisible by TP size (64). |
|
[default7]:[rank47]: Traceback (most recent call last): |
|
[default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in <module> |
|
[default7]:[rank47]: trainer = DistributedTrainer(config_file) |
|
[default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 172, in __init__ |
|
[default7]:[rank47]: self.model = self.init_model() # Defines self.model |
|
[default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 672, in init_model |
|
[default7]:[rank47]: model = self._init_model_instance() |
|
[default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 682, in _init_model_instance |
|
[default7]:[rank47]: model = self._init_model( |
|
[default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 751, in _init_model |
|
[default7]:[rank47]: model = build_model( |
|
[default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/base.py", line 230, in build_model |
|
[default7]:[rank47]: block.build_and_set_rank(target_pp_ranks[target_pp_rank_idx]) |
|
[default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 52, in build_and_set_rank |
|
[default7]:[rank47]: self.pp_block = self.module_builder(**self.module_kwargs) |
|
[default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 613, in __init__ |
|
[default7]:[rank47]: self.attn = CausalSelfAttention( |
|
[default7]:[rank47]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 271, in __init__ |
|
[default7]:[rank47]: config.num_attention_heads % tp_pg.size() == 0 |
|
[default7]:[rank47]: AssertionError: Number of attention heads (32) must be divisible by TP size (64). |
|
[default0]:[rank40]: Traceback (most recent call last): |
|
[default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in <module> |
|
[default0]:[rank40]: trainer = DistributedTrainer(config_file) |
|
[default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 172, in __init__ |
|
[default0]:[rank40]: self.model = self.init_model() # Defines self.model |
|
[default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 672, in init_model |
|
[default0]:[rank40]: model = self._init_model_instance() |
|
[default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 682, in _init_model_instance |
|
[default0]:[rank40]: model = self._init_model( |
|
[default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 751, in _init_model |
|
[default0]:[rank40]: model = build_model( |
|
[def[default0]:[rank0]: Traceback (most recent call last): |
|
ault0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/base.py", line 230, in build_model |
|
[default0]:[rank40]: block.build_and_set_rank(target_pp_ranks[target_pp_rank_idx]) |
|
[default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 52, in build_and_set_rank |
|
[default0]:[rank40]: self.pp_block = self.module_builder(**self.module_kwargs) |
|
[default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 613, in __init__ |
|
[default0]:[rank40]: self.attn = CausalSelfAttention( |
|
[default0]:[rank40]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 271, in __init__ |
|
[default0]:[rank40]: config.num_attention_heads % tp_pg.size() == 0 |
|
[default0]:[rank40]: AssertionError: Number of attention heads (32) must be divisible by TP size (64). |
|
[default6]:[rank46]: Traceback (most recent call last): |
|
[default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in <module> |
|
[default6]:[rank46]: trainer = DistributedTrainer(config_file) |
|
[default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 172, in __init__ |
|
[default6]:[rank46]: self.model = self.init_model() # Defines self.model |
|
[default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 672, in init_model |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in <module> |
|
[default6]:[rank46]: model = self._init_model_instance() |
|
[default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 682, in _init_model_instance |
|
[default6]:[rank46]: model = self._init_model( |
|
[default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 751, in _init_model |
|
[default6]:[rank46]: model = build_model( |
|
[default0]:[rank0]: trainer = DistributedTrainer(config_file) |
|
[default5]:[rank5]: Traceback (most recent call last): |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in <module> |
|
[default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/base.py", line 230, in build_model |
|
[default6]:[rank46]: block.build_and_set_rank(target_pp_ranks[target_pp_rank_idx]) |
|
[default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 52, in build_and_set_rank |
|
[default6]:[rank46]: self.pp_block = self.module_builder(**self.module_kwargs) |
|
[default5]:[rank5]: trainer = DistributedTrainer(config_file) |
|
[default1]:[rank1]: Traceback (most recent call last): |
|
[default4]:[rank4]: Traceback (most recent call last): |
|
[default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 613, in __init__ |
|
[default6]:[rank46]: self.attn = CausalSelfAttention( |
|
[default6]:[rank46]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 271, in __init__ |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in <module> |
|
[default6]:[rank46]: config.num_attention_heads % tp_pg.size() == 0 |
|
[default6]:[rank46]: AssertionError: Number of attention heads (32) must be divisible by TP size (64). |
|
[default4]:[rank4]: trainer = DistributedTrainer(config_file) |
|
[default4]:[rank44]: Traceback (most recent call last): |
|
[default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in <module> |
|
[default6]:[rank38]: Traceback (most recent call last): |
|
[default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in <module> |
|
[default6]:[rank38]: trainer = DistributedTrainer(config_file) |
|
[default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 172, in __init__ |
|
[default6]:[rank38]: self.model = self.init_model() # Defines self.model |
|
[default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 672, in init_model |
|
[default6]:[rank38]: model = self._init_model_instance() |
|
[default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 682, in _init_model_instance |
|
[default6]:[rank38]: model = self._init_model( |
|
[default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 751, in _init_model |
|
[default6]:[rank38]: model = build_model( |
|
[def[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in <module> |
|
[default4]:[rank44]: trainer = DistributedTrainer(config_file) |
|
[default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 172, in __init__ |
|
[default4]:[rank44]: self.model = self.init_model() # Defines self.model |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 172, in __init__ |
|
[default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 672, in init_model |
|
[default4]:[rank44]: model = self._init_model_instance() |
|
[default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 682, in _init_model_instance |
|
[default4]:[rank44]: model = self._init_model( |
|
[default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 751, in _init_model |
|
[default4]:[rank44]: model = build_model( |
|
[default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/base.py", line 230, in build_model |
|
[default3]:[rank19]: Traceback (most recent call last): |
|
[default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in <module> |
|
[default3]:[rank19]: trainer = DistributedTrainer(config_file) |
|
[default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 172, in __init__ |
|
[default3]:[rank19]: self.model = self.init_model() # Defines self.model |
|
[default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 672, in init_model |
|
[default3]:[rank19]: model = self._init_model_instance() |
|
[default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 682, in _init_model_instance |
|
[default3]:[rank19]: model = self._init_model( |
|
ault6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/base.py", line 230, in build_model |
|
[default6]:[rank38]: block.build_and_set_rank(target_pp_ranks[target_pp_rank_idx]) |
|
[default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 52, in build_and_set_rank |
|
[default6]:[rank38]: self.pp_block = self.module_builder(**self.module_kwargs) |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 172, in __init__ |
|
[default4]:[rank44]: block.build_and_set_rank(target_pp_ranks[target_pp_rank_idx]) |
|
[default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 52, in build_and_set_rank |
|
[default4]:[rank44]: self.pp_block = self.module_builder(**self.module_kwargs) |
|
[default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 613, in __init__ |
|
[default6]:[rank38]: self.attn = CausalSelfAttention( |
|
[default6]:[rank38]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 271, in __init__ |
|
[default6]:[rank38]: config.num_attention_heads % tp_pg.size() == 0 |
|
[default6]:[rank38]: AssertionError: Number of attention heads (32) must be divisible by TP size (64). |
|
[default5]:[rank5]: self.model = self.init_model() # Defines self.model |
|
[default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 613, in __init__ |
|
[default4]:[rank44]: self.attn = CausalSelfAttention( |
|
[default4]:[rank44]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 271, in __init__ |
|
[default4]:[rank44]: config.num_attention_heads % tp_pg.size() == 0 |
|
[default4]:[rank44]: AssertionError: Number of attention heads (32) must be divisible by TP size (64). |
|
[default1]:[rank1]: trainer = DistributedTrainer(config_file) |
|
[default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 751, in _init_model |
|
[default3]:[rank19]: model = build_model( |
|
[default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/base.py", line 230, in build_model |
|
[default3]:[rank19]: block.build_and_set_rank(target_pp_ranks[target_pp_rank_idx]) |
|
[default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 52, in build_and_set_rank |
|
[default3]:[rank19]: self.pp_block = self.module_builder(**self.module_kwargs) |
|
[default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 613, in __init__ |
|
[default3]:[rank19]: self.attn = CausalSelfAttention( |
|
[default0]:[rank0]: self.model = self.init_model() # Defines self.model |
|
[default3]:[rank19]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 271, in __init__ |
|
[default3]:[rank19]: config.num_attention_heads % tp_pg.size() == 0 |
|
[default3]:[rank19]: AssertionError: Number of attention heads (32) must be divisible by TP size (64). |
|
[default7]:[rank7]: Traceback (most recent call last): |
|
[default7]:[rank55]: Traceback (most recent call last): |
|
[default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in <module> |
|
[default7]:[rank55]: trainer = DistributedTrainer(config_file) |
|
[default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 172, in __init__ |
|
[default7]:[rank55]: self.model = self.init_model() # Defines self.model |
|
[default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 672, in init_model |
|
[default7]:[rank55]: model = self._init_model_instance() |
|
[default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 682, in _init_model_instance |
|
[default7]:[rank55]: model = self._init_model( |
|
[default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 751, in _init_model |
|
[default7]:[rank55]: model = build_model( |
|
[def[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 172, in __init__ |
|
[default1]:[rank17]: Traceback (most recent call last): |
|
[default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in <module> |
|
[default1]:[rank17]: trainer = DistributedTrainer(config_file) |
|
ault7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/base.py", line 230, in build_model |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 672, in init_model |
|
[default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 172, in __init__ |
|
[default1]:[rank17]: self.model = self.init_model() # Defines self.model |
|
[default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 672, in init_model |
|
[default1]:[rank17]: model = self._init_model_instance() |
|
[default7]:[rank55]: block.build_and_set_rank(target_pp_ranks[target_pp_rank_idx]) |
|
[default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 52, in build_and_set_rank |
|
[default7]:[rank55]: self.pp_block = self.module_builder(**self.module_kwargs) |
|
[default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 613, in __init__ |
|
[default7]:[rank55]: self.attn = CausalSelfAttention( |
|
[default7]:[rank55]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 271, in __init__ |
|
[default7]:[rank55]: config.num_attention_heads % tp_pg.size() == 0 |
|
[default7]:[rank55]: AssertionError: Number of attention heads (32) must be divisible by TP size (64). |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 172, in __init__ |
|
[default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 682, in _init_model_instance |
|
[default1]:[rank17]: model = self._init_model( |
|
[default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 751, in _init_model |
|
[default1]:[rank17]: model = build_model( |
|
[default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/base.py", line 230, in build_model |
|
[default1]:[rank1]: self.model = self.init_model() # Defines self.model |
|
[default1]:[rank17]: block.build_and_set_rank(target_pp_ranks[target_pp_rank_idx]) |
|
[default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 52, in build_and_set_rank |
|
[default0]:[rank0]: model = self._init_model_instance() |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in <module> |
|
[default1]:[rank17]: self.pp_block = self.module_builder(**self.module_kwargs) |
|
[default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 613, in __init__ |
|
[default1]:[rank17]: self.attn = CausalSelfAttention( |
|
[default1]:[rank17]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 271, in __init__ |
|
[default1]:[rank17]: config.num_attention_heads % tp_pg.size() == 0 |
|
[default1]:[rank17]: AssertionError: Number of attention heads (32) must be divisible by TP size (64). |
|
[default7]:[rank7]: trainer = DistributedTrainer(config_file) |
|
[default5]:[rank45]: Traceback (most recent call last): |
|
[default4]:[rank4]: self.model = self.init_model() # Defines self.model |
|
[default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in <module> |
|
[default1]:[rank41]: Traceback (most recent call last): |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 672, in init_model |
|
[default5]:[rank45]: trainer = DistributedTrainer(config_file) |
|
[default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in <module> |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 672, in init_model |
|
[default1]:[rank41]: trainer = DistributedTrainer(config_file) |
|
[default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 172, in __init__ |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 672, in init_model |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 172, in __init__ |
|
[default5]:[rank45]: self.model = self.init_model() # Defines self.model |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 682, in _init_model_instance |
|
[default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 172, in __init__ |
|
[default4]:[rank4]: model = self._init_model_instance() |
|
[default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 672, in init_model |
|
[default7]:[rank7]: self.model = self.init_model() # Defines self.model |
|
[default1]:[rank1]: model = self._init_model_instance() |
|
[default5]:[rank45]: model = self._init_model_instance() |
|
[default5]:[rank5]: model = self._init_model_instance() |
|
[default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 682, in _init_model_instance |
|
[default7]:[rank23]: Traceback (most recent call last): |
|
[default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in <module> |
|
[default7]:[rank23]: trainer = DistributedTrainer(config_file) |
|
[default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 172, in __init__ |
|
[default7]:[rank23]: self.model = self.init_model() # Defines self.model |
|
[default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 672, in init_model |
|
[default7]:[rank23]: model = self._init_model_instance() |
|
[default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 682, in _init_model_instance |
|
[default7]:[rank23]: model = self._init_model( |
|
[default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 751, in _init_model |
|
[default7]:[rank23]: model = build_model( |
|
[def[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 672, in init_model |
|
[default5]:[rank45]: model = self._init_model( |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 682, in _init_model_instance |
|
[default0]:[rank0]: model = self._init_model( |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 682, in _init_model_instance |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 682, in _init_model_instance |
|
[default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 751, in _init_model |
|
[default1]:[rank41]: self.model = self.init_model() # Defines self.model |
|
ault7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/base.py", line 230, in build_model |
|
[default7]:[rank23]: block.build_and_set_rank(target_pp_ranks[target_pp_rank_idx]) |
|
[default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 52, in build_and_set_rank |
|
[default7]:[rank23]: self.pp_block = self.module_builder(**self.module_kwargs) |
|
[default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 613, in __init__ |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 751, in _init_model |
|
[default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 672, in init_model |
|
[default7]:[rank23]: self.attn = CausalSelfAttention( |
|
[default7]:[rank23]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 271, in __init__ |
|
[default7]:[rank23]: config.num_attention_heads % tp_pg.size() == 0 |
|
[default7]:[rank23]: AssertionError: Number of attention heads (32) must be divisible by TP size (64). |
|
[default4]:[rank4]: model = self._init_model( |
|
[default1]:[rank41]: model = self._init_model_instance() |
|
[default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 682, in _init_model_instance |
|
[default1]:[rank41]: model = self._init_model( |
|
[default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 751, in _init_model |
|
[default1]:[rank41]: model = build_model( |
|
[default7]:[rank7]: model = self._init_model_instance() |
|
[default5]:[rank5]: model = self._init_model( |
|
[default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/base.py", line 230, in build_model |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 751, in _init_model |
|
[default1]:[rank41]: block.build_and_set_rank(target_pp_ranks[target_pp_rank_idx]) |
|
[default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 52, in build_and_set_rank |
|
[default0]:[rank0]: model = build_model( |
|
[default1]:[rank1]: model = self._init_model( |
|
[default1]:[rank41]: self.pp_block = self.module_builder(**self.module_kwargs) |
|
[default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 613, in __init__ |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 682, in _init_model_instance |
|
[default1]:[rank41]: self.attn = CausalSelfAttention( |
|
[default6]:[rank22]: Traceback (most recent call last): |
|
[default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in <module> |
|
[default6]:[rank22]: trainer = DistributedTrainer(config_file) |
|
[default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 172, in __init__ |
|
[default6]:[rank22]: self.model = self.init_model() # Defines self.model |
|
[default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 672, in init_model |
|
[default6]:[rank22]: model = self._init_model_instance() |
|
[default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 682, in _init_model_instance |
|
[default6]:[rank22]: model = self._init_model( |
|
[default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 751, in _init_model |
|
[default6]:[rank22]: model = build_model( |
|
[def[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 751, in _init_model |
|
[default5]:[rank5]: model = build_model( |
|
[default1]:[rank41]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 271, in __init__ |
|
[default1]:[rank41]: config.num_attention_heads % tp_pg.size() == 0 |
|
[default4]:[rank4]: model = build_model( |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 751, in _init_model |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/base.py", line 230, in build_model |
|
[default1]:[rank41]: AssertionError: Number of attention heads (32) must be divisible by TP size (64). |
|
ault6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/base.py", line 230, in build_model |
|
[default6]:[rank22]: block.build_and_set_rank(target_pp_ranks[target_pp_rank_idx]) |
|
[default2]:[rank34]: Traceback (most recent call last): |
|
[default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in <module> |
|
[default2]:[rank34]: trainer = DistributedTrainer(config_file) |
|
[default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 172, in __init__ |
|
[default2]:[rank34]: self.model = self.init_model() # Defines self.model |
|
[default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 672, in init_model |
|
[default2]:[rank34]: model = self._init_model_instance() |
|
[default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 682, in _init_model_instance |
|
[default2]:[rank34]: model = self._init_model( |
|
[default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 751, in _init_model |
|
[default7]:[rank7]: model = self._init_model( |
|
[default5]:[rank45]: model = build_model( |
|
[default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 52, in build_and_set_rank |
|
[default6]:[rank22]: self.pp_block = self.module_builder(**self.module_kwargs) |
|
[default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 613, in __init__ |
|
[default6]:[rank22]: self.attn = CausalSelfAttention( |
|
[default6]:[rank22]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 271, in __init__ |
|
[default6]:[rank22]: config.num_attention_heads % tp_pg.size() == 0 |
|
[default6]:[rank22]: AssertionError: Number of attention heads (32) must be divisible by TP size (64). |
|
[default2]:[rank34]: model = build_model( |
|
[default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/base.py", line 230, in build_model |
|
[default2]:[rank34]: block.build_and_set_rank(target_pp_ranks[target_pp_rank_idx]) |
|
[default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 52, in build_and_set_rank |
|
[default2]:[rank34]: self.pp_block = self.module_builder(**self.module_kwargs) |
|
[default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 613, in __init__ |
|
[default2]:[rank34]: self.attn = CausalSelfAttention( |
|
[default2]:[rank34]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 271, in __init__ |
|
[default2]:[rank34]: config.num_attention_heads % tp_pg.size() == 0 |
|
[default2]:[rank34]: AssertionError: Number of attention heads (32) must be divisible by TP size (64). |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/base.py", line 230, in build_model |
|
[default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/base.py", line 230, in build_model |
|
[default1]:[rank1]: model = build_model( |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/base.py", line 230, in build_model |
|
[default5]:[rank45]: block.build_and_set_rank(target_pp_ranks[target_pp_rank_idx]) |
|
[default0]:[rank0]: block.build_and_set_rank(target_pp_ranks[target_pp_rank_idx]) |
|
[default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 52, in build_and_set_rank |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 751, in _init_model |
|
[default5]:[rank45]: self.pp_block = self.module_builder(**self.module_kwargs) |
|
[default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 613, in __init__ |
|
[default5]:[rank45]: self.attn = CausalSelfAttention( |
|
[default1]:[rank1]: block.build_and_set_rank(target_pp_ranks[target_pp_rank_idx]) |
|
[default5]:[rank45]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 271, in __init__ |
|
[default5]:[rank45]: config.num_attention_heads % tp_pg.size() == 0 |
|
[default5]:[rank45]: AssertionError: Number of attention heads (32) must be divisible by TP size (64). |
|
[default6]:[rank54]: Traceback (most recent call last): |
|
[default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in <module> |
|
[default6]:[rank54]: trainer = DistributedTrainer(config_file) |
|
[default5]:[rank5]: block.build_and_set_rank(target_pp_ranks[target_pp_rank_idx]) |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/base.py", line 230, in build_model |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 52, in build_and_set_rank |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 52, in build_and_set_rank |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 52, in build_and_set_rank |
|
[default4]:[rank4]: block.build_and_set_rank(target_pp_ranks[target_pp_rank_idx]) |
|
[default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 172, in __init__ |
|
[default6]:[rank54]: self.model = self.init_model() # Defines self.model |
|
[default7]:[rank7]: model = build_model( |
|
[default3]:[rank43]: Traceback (most recent call last): |
|
[default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in <module> |
|
[default3]:[rank43]: trainer = DistributedTrainer(config_file) |
|
[default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 172, in __init__ |
|
[default3]:[rank43]: self.model = self.init_model() # Defines self.model |
|
[default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 672, in init_model |
|
[default3]:[rank43]: model = self._init_model_instance() |
|
[default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 682, in _init_model_instance |
|
[default3]:[rank43]: model = self._init_model( |
|
[default5]:[rank21]: Traceback (most recent call last): |
|
[default4]:[rank20]: Traceback (most recent call last): |
|
[default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in <module> |
|
[default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 672, in init_model |
|
[default6]:[rank54]: model = self._init_model_instance() |
|
[default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 682, in _init_model_instance |
|
[default6]:[rank54]: model = self._init_model( |
|
[default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 751, in _init_model |
|
[default6]:[rank54]: model = build_model( |
|
[default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/base.py", line 230, in build_model |
|
[default6]:[rank54]: block.build_and_set_rank(target_pp_ranks[target_pp_rank_idx]) |
|
[default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 52, in build_and_set_rank |
|
[default6]:[rank54]: self.pp_block = self.module_builder(**self.module_kwargs) |
|
[default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 613, in __init__ |
|
[default6]:[rank54]: self.attn = CausalSelfAttention( |
|
[default6]:[rank54]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 271, in __init__ |
|
[default6]:[rank54]: config.num_attention_heads % tp_pg.size() == 0 |
|
[default6]:[rank54]: AssertionError: Number of attention heads (32) must be divisible by TP size (64). |
|
[default4]:[rank52]: Traceback (most recent call last): |
|
[default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in <module> |
|
[default4]:[rank52]: trainer = DistributedTrainer(config_file) |
|
[default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 172, in __init__ |
|
[default4]:[rank52]: self.model = self.init_model() # Defines self.model |
|
[default4]:[rank20]: trainer = DistributedTrainer(config_file) |
|
[default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 172, in __init__ |
|
[default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in <module> |
|
[default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 672, in init_model |
|
[default4]:[rank52]: model = self._init_model_instance() |
|
[default4]:[rank20]: self.model = self.init_model() # Defines self.model |
|
[default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 672, in init_model |
|
[default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 682, in _init_model_instance |
|
[default4]:[rank52]: model = self._init_model( |
|
[default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 751, in _init_model |
|
[default4]:[rank52]: model = build_model( |
|
[default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/base.py", line 230, in build_model |
|
[default5]:[rank5]: self.pp_block = self.module_builder(**self.module_kwargs) |
|
[default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 751, in _init_model |
|
[default3]:[rank43]: model = build_model( |
|
[default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/base.py", line 230, in build_model |
|
[default3]:[rank43]: block.build_and_set_rank(target_pp_ranks[target_pp_rank_idx]) |
|
[default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 52, in build_and_set_rank |
|
[default3]:[rank43]: self.pp_block = self.module_builder(**self.module_kwargs) |
|
[default4]:[rank20]: model = self._init_model_instance() |
|
[default5]:[rank21]: trainer = DistributedTrainer(config_file) |
|
[default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 682, in _init_model_instance |
|
[default4]:[rank20]: model = self._init_model( |
|
[default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 751, in _init_model |
|
[default4]:[rank20]: model = build_model( |
|
[default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/base.py", line 230, in build_model |
|
[default4]:[rank52]: block.build_and_set_rank(target_pp_ranks[target_pp_rank_idx]) |
|
[default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 52, in build_and_set_rank |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 613, in __init__ |
|
[default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 613, in __init__ |
|
[default3]:[rank43]: self.attn = CausalSelfAttention( |
|
[default3]:[rank43]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 271, in __init__ |
|
[default3]:[rank43]: config.num_attention_heads % tp_pg.size() == 0 |
|
[default3]:[rank43]: AssertionError: Number of attention heads (32) must be divisible by TP size (64). |
|
[default4]:[rank20]: block.build_and_set_rank(target_pp_ranks[target_pp_rank_idx]) |
|
[default4]:[rank52]: self.pp_block = self.module_builder(**self.module_kwargs) |
|
[default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 613, in __init__ |
|
[default0]:[rank0]: self.pp_block = self.module_builder(**self.module_kwargs) |
|
[default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 172, in __init__ |
|
[default5]:[rank21]: self.model = self.init_model() # Defines self.model |
|
[default4]:[rank52]: self.attn = CausalSelfAttention( |
|
[default4]:[rank52]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 271, in __init__ |
|
[default4]:[rank52]: config.num_attention_heads % tp_pg.size() == 0 |
|
[default4]:[rank52]: AssertionError: Number of attention heads (32) must be divisible by TP size (64). |
|
[default2]:[rank2]: Traceback (most recent call last): |
|
[default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 672, in init_model |
|
[default5]:[rank21]: model = self._init_model_instance() |
|
[default1]:[rank1]: self.pp_block = self.module_builder(**self.module_kwargs) |
|
[default5]:[rank5]: self.attn = CausalSelfAttention( |
|
[default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 682, in _init_model_instance |
|
[default5]:[rank21]: model = self._init_model( |
|
[default5]:[rank5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 271, in __init__ |
|
[default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 52, in build_and_set_rank |
|
[default4]:[rank20]: self.pp_block = self.module_builder(**self.module_kwargs) |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/base.py", line 230, in build_model |
|
[default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 613, in __init__ |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 613, in __init__ |
|
[default4]:[rank20]: self.attn = CausalSelfAttention( |
|
[default4]:[rank20]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 271, in __init__ |
|
[default4]:[rank20]: config.num_attention_heads % tp_pg.size() == 0 |
|
[default4]:[rank20]: AssertionError: Number of attention heads (32) must be divisible by TP size (64). |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 613, in __init__ |
|
[default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 751, in _init_model |
|
[default5]:[rank21]: model = build_model( |
|
[default0]:[rank0]: self.attn = CausalSelfAttention( |
|
[default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/base.py", line 230, in build_model |
|
[default5]:[rank21]: block.build_and_set_rank(target_pp_ranks[target_pp_rank_idx]) |
|
[default3]:[rank3]: Traceback (most recent call last): |
|
[default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in <module> |
|
[default5]:[rank5]: config.num_attention_heads % tp_pg.size() == 0 |
|
[default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 52, in build_and_set_rank |
|
[default5]:[rank21]: self.pp_block = self.module_builder(**self.module_kwargs) |
|
[default5]:[rank5]: AssertionError: Number of attention heads (32) must be divisible by TP size (64). |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 52, in build_and_set_rank |
|
[default1]:[rank1]: self.attn = CausalSelfAttention( |
|
[default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 613, in __init__ |
|
[default5]:[rank21]: self.attn = CausalSelfAttention( |
|
[default5]:[rank21]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 271, in __init__ |
|
[default5]:[rank21]: config.num_attention_heads % tp_pg.size() == 0 |
|
[default5]:[rank21]: AssertionError: Number of attention heads (32) must be divisible by TP size (64). |
|
[default1]:[rank1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 271, in __init__ |
|
[default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in <module> |
|
[default2]:[rank2]: trainer = DistributedTrainer(config_file) |
|
[default4]:[rank4]: self.pp_block = self.module_builder(**self.module_kwargs) |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 613, in __init__ |
|
[default0]:[rank0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 271, in __init__ |
|
[default0]:[rank0]: config.num_attention_heads % tp_pg.size() == 0 |
|
[default7]:[rank7]: block.build_and_set_rank(target_pp_ranks[target_pp_rank_idx]) |
|
[default1]:[rank1]: config.num_attention_heads % tp_pg.size() == 0 |
|
[default1]:[rank1]: AssertionError: Number of attention heads (32) must be divisible by TP size (64). |
|
[default3]:[rank3]: trainer = DistributedTrainer(config_file) |
|
[default4]:[rank4]: self.attn = CausalSelfAttention( |
|
[default1]:[rank9]: Traceback (most recent call last): |
|
[default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in <module> |
|
[default1]:[rank9]: trainer = DistributedTrainer(config_file) |
|
[default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 172, in __init__ |
|
[default1]:[rank9]: self.model = self.init_model() # Defines self.model |
|
[default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 672, in init_model |
|
[default1]:[rank9]: model = self._init_model_instance() |
|
[default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 682, in _init_model_instance |
|
[default1]:[rank9]: model = self._init_model( |
|
[default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 751, in _init_model |
|
[default1]:[rank9]: model = build_model( |
|
[default1]:[ran[default0]:[rank0]: AssertionError: Number of attention heads (32) must be divisible by TP size (64). |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 52, in build_and_set_rank |
|
[default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 172, in __init__ |
|
[default2]:[rank2]: self.model = self.init_model() # Defines self.model |
|
[default7]:[rank7]: self.pp_block = self.module_builder(**self.module_kwargs) |
|
k9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/base.py", line 230, in build_model |
|
[default1]:[rank9]: block.build_and_set_rank(target_pp_ranks[target_pp_rank_idx]) |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 613, in __init__ |
|
[default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 52, in build_and_set_rank |
|
[default1]:[rank9]: self.pp_block = self.module_builder(**self.module_kwargs) |
|
[default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 613, in __init__ |
|
[default1]:[rank9]: self.attn = CausalSelfAttention( |
|
[default1]:[rank9]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 271, in __init__ |
|
[default1]:[rank9]: config.num_attention_heads % tp_pg.size() == 0 |
|
[default1]:[rank9]: AssertionError: Number of attention heads (32) must be divisible by TP size (64). |
|
[default0]:[rank32]: Traceback (most recent call last): |
|
[default7]:[rank7]: self.attn = CausalSelfAttention( |
|
[default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 672, in init_model |
|
[default2]:[rank2]: model = self._init_model_instance() |
|
[default4]:[rank4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 271, in __init__ |
|
[default7]:[rank15]: Traceback (most recent call last): |
|
[default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in <module> |
|
[default7]:[rank15]: trainer = DistributedTrainer(config_file) |
|
[default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 172, in __init__ |
|
[default7]:[rank15]: self.model = self.init_model() # Defines self.model |
|
[default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 672, in init_model |
|
[default7]:[rank15]: model = self._init_model_instance() |
|
[default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 682, in _init_model_instance |
|
[default7]:[rank15]: model = self._init_model( |
|
[default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 751, in _init_model |
|
[default7]:[rank15]: model = build_model( |
|
[default0]:[rank32]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in <module> |
|
[default3]:[rank35]: Traceback (most recent call last): |
|
[default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in <module> |
|
[default4]:[rank4]: config.num_attention_heads % tp_pg.size() == 0 |
|
[default0]:[rank32]: trainer = DistributedTrainer(config_file) |
|
[default0]:[rank32]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 172, in __init__ |
|
[default0]:[rank32]: self.model = self.init_model() # Defines self.model |
|
[default0]:[rank32]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 672, in init_model |
|
[default0]:[rank32]: model = self._init_model_instance() |
|
[default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 682, in _init_model_instance |
|
[default3]:[rank35]: trainer = DistributedTrainer(config_file) |
|
[default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 172, in __init__ |
|
[default2]:[rank2]: model = self._init_model( |
|
[default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/base.py", line 230, in build_model |
|
[default7]:[rank15]: block.build_and_set_rank(target_pp_ranks[target_pp_rank_idx]) |
|
[default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 52, in build_and_set_rank |
|
[default7]:[rank15]: self.pp_block = self.module_builder(**self.module_kwargs) |
|
[default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 613, in __init__ |
|
[default7]:[rank15]: self.attn = CausalSelfAttention( |
|
[default7]:[rank15]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 271, in __init__ |
|
[default7]:[rank15]: config.num_attention_heads % tp_pg.size() == 0 |
|
[default7]:[rank15]: AssertionError: Number of attention heads (32) must be divisible by TP size (64). |
|
[default0]:[rank32]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 682, in _init_model_instance |
|
[default3]:[rank35]: self.model = self.init_model() # Defines self.model |
|
[default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 672, in init_model |
|
[default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 172, in __init__ |
|
[default3]:[rank35]: model = self._init_model_instance() |
|
[default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 682, in _init_model_instance |
|
[default3]:[rank3]: self.model = self.init_model() # Defines self.model |
|
[default4]:[rank4]: AssertionError: Number of attention heads (32) must be divisible by TP size (64). |
|
[default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 751, in _init_model |
|
[default3]:[rank35]: model = self._init_model( |
|
[default2]:[rank2]: model = build_model( |
|
[default7]:[rank7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 271, in __init__ |
|
[default7]:[rank7]: config.num_attention_heads % tp_pg.size() == 0 |
|
[default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 751, in _init_model |
|
[default3]:[rank35]: model = build_model( |
|
[default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/base.py", line 230, in build_model |
|
[default3]:[rank35]: block.build_and_set_rank(target_pp_ranks[target_pp_rank_idx]) |
|
[default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 52, in build_and_set_rank |
|
[default6]:[rank14]: Traceback (most recent call last): |
|
[default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in <module> |
|
[default3]:[rank35]: self.pp_block = self.module_builder(**self.module_kwargs) |
|
[default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 613, in __init__ |
|
[default3]:[rank35]: self.attn = CausalSelfAttention( |
|
[default3]:[rank35]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 271, in __init__ |
|
[default3]:[rank35]: config.num_attention_heads % tp_pg.size() == 0 |
|
[default3]:[rank35]: AssertionError: Number of attention heads (32) must be divisible by TP size (64). |
|
[default0]:[rank32]: model = self._init_model( |
|
[default6]:[rank14]: trainer = DistributedTrainer(config_file) |
|
[default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 172, in __init__ |
|
[default6]:[rank14]: self.model = self.init_model() # Defines self.model |
|
[default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 672, in init_model |
|
[default0]:[rank32]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 751, in _init_model |
|
[default6]:[rank14]: model = self._init_model_instance() |
|
[default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 682, in _init_model_instance |
|
[default0]:[rank32]: model = build_model( |
|
[default0]:[rank32]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/base.py", line 230, in build_model |
|
[default0]:[rank32]: block.build_and_set_rank(target_pp_ranks[target_pp_rank_idx]) |
|
[default0]:[rank32]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 52, in build_and_set_rank |
|
[default0]:[rank32]: self.pp_block = self.module_builder(**self.module_kwargs) |
|
[default6]:[rank14]: model = self._init_model( |
|
[default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 751, in _init_model |
|
[default0]:[rank32]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 613, in __init__ |
|
[default0]:[rank8]: Traceback (most recent call last): |
|
[default5]:[rank37]: Traceback (most recent call last): |
|
[default5]:[rank13]: Traceback (most recent call last): |
|
[default6]:[rank14]: model = build_model( |
|
[default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/base.py", line 230, in build_model |
|
[default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in <module> |
|
[default6]:[rank14]: block.build_and_set_rank(target_pp_ranks[target_pp_rank_idx]) |
|
[default0]:[rank8]: trainer = DistributedTrainer(config_file) |
|
[default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 172, in __init__ |
|
[default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in <module> |
|
[default0]:[rank32]: self.attn = CausalSelfAttention( |
|
[default0]:[rank32]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 271, in __init__ |
|
[default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 52, in build_and_set_rank |
|
[default4]:[rank36]: Traceback (most recent call last): |
|
[default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in <module> |
|
[default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in <module> |
|
[default0]:[rank8]: self.model = self.init_model() # Defines self.model |
|
[default5]:[rank37]: trainer = DistributedTrainer(config_file) |
|
[default4]:[rank36]: trainer = DistributedTrainer(config_file) |
|
[default0]:[rank32]: config.num_attention_heads % tp_pg.size() == 0 |
|
[default0]:[rank32]: AssertionError: Number of attention heads (32) must be divisible by TP size (64). |
|
[default6]:[rank14]: self.pp_block = self.module_builder(**self.module_kwargs) |
|
[default5]:[rank13]: trainer = DistributedTrainer(config_file) |
|
[default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 672, in init_model |
|
[default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 172, in __init__ |
|
[default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 172, in __init__ |
|
[default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 172, in __init__ |
|
[default5]:[rank37]: self.model = self.init_model() # Defines self.model |
|
[default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 613, in __init__ |
|
[default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 672, in init_model |
|
[default5]:[rank37]: model = self._init_model_instance() |
|
[default0]:[rank8]: model = self._init_model_instance() |
|
[default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 682, in _init_model_instance |
|
[default5]:[rank13]: self.model = self.init_model() # Defines self.model |
|
[default4]:[rank36]: self.model = self.init_model() # Defines self.model |
|
[default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 672, in init_model |
|
[default4]:[rank36]: model = self._init_model_instance() |
|
[default6]:[rank14]: self.attn = CausalSelfAttention( |
|
[default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 682, in _init_model_instance |
|
[default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 682, in _init_model_instance |
|
[default4]:[rank36]: model = self._init_model( |
|
[default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 751, in _init_model |
|
[default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 672, in init_model |
|
[default4]:[rank36]: model = build_model( |
|
[default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/base.py", line 230, in build_model |
|
[default4]:[rank36]: block.build_and_set_rank(target_pp_ranks[target_pp_rank_idx]) |
|
[default5]:[rank37]: model = self._init_model( |
|
[default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 751, in _init_model |
|
[default6]:[rank14]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 271, in __init__ |
|
[default5]:[rank37]: model = build_model( |
|
[default0]:[rank8]: model = self._init_model( |
|
[default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/base.py", line 230, in build_model |
|
[default5]:[rank13]: model = self._init_model_instance() |
|
[default5]:[rank37]: block.build_and_set_rank(target_pp_ranks[target_pp_rank_idx]) |
|
[default6]:[rank14]: config.num_attention_heads % tp_pg.size() == 0 |
|
[default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 52, in build_and_set_rank |
|
[default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 682, in _init_model_instance |
|
[default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 751, in _init_model |
|
[default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 52, in build_and_set_rank |
|
[default5]:[rank13]: model = self._init_model( |
|
[default5]:[rank37]: self.pp_block = self.module_builder(**self.module_kwargs) |
|
[default0]:[rank8]: model = build_model( |
|
[default4]:[rank36]: self.pp_block = self.module_builder(**self.module_kwargs) |
|
[default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 751, in _init_model |
|
[default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 613, in __init__ |
|
[default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/base.py", line 230, in build_model |
|
[default5]:[rank37]: self.attn = CausalSelfAttention( |
|
[default5]:[rank37]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 271, in __init__ |
|
[default6]:[rank14]: AssertionError: Number of attention heads (32) must be divisible by TP size (64). |
|
[default5]:[rank13]: model = build_model( |
|
[default0]:[rank8]: block.build_and_set_rank(target_pp_ranks[target_pp_rank_idx]) |
|
[default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/base.py", line 230, in build_model |
|
[default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 52, in build_and_set_rank |
|
[default0]:[rank8]: self.pp_block = self.module_builder(**self.module_kwargs) |
|
[default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 613, in __init__ |
|
[default5]:[rank13]: block.build_and_set_rank(target_pp_ranks[target_pp_rank_idx]) |
|
[default0]:[rank8]: self.attn = CausalSelfAttention( |
|
[default0]:[rank8]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 271, in __init__ |
|
[default0]:[rank8]: config.num_attention_heads % tp_pg.size() == 0 |
|
[default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 52, in build_and_set_rank |
|
[default0]:[rank8]: AssertionError: Number of attention heads (32) must be divisible by TP size (64). |
|
[default5]:[rank13]: self.pp_block = self.module_builder(**self.module_kwargs) |
|
[default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 613, in __init__ |
|
[default5]:[rank13]: self.attn = CausalSelfAttention( |
|
[default5]:[rank13]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 271, in __init__ |
|
[default5]:[rank13]: config.num_attention_heads % tp_pg.size() == 0 |
|
[default5]:[rank13]: AssertionError: Number of attention heads (32) must be divisible by TP size (64). |
|
[default1]:[rank49]: Traceback (most recent call last): |
|
[default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in <module> |
|
[default1]:[rank49]: trainer = DistributedTrainer(config_file) |
|
[default2]:[rank50]: Traceback (most recent call last): |
|
[default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in <module> |
|
[default2]:[rank50]: trainer = DistributedTrainer(config_file) |
|
[default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 172, in __init__ |
|
[default1]:[rank49]: self.model = self.init_model() # Defines self.model |
|
[default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 172, in __init__ |
|
[default2]:[rank50]: self.model = self.init_model() # Defines self.model |
|
[default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 672, in init_model |
|
[default1]:[rank49]: model = self._init_model_instance() |
|
[default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 682, in _init_model_instance |
|
[default1]:[rank49]: model = self._init_model( |
|
[default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 751, in _init_model |
|
[default1]:[rank49]: model = build_model( |
|
[default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/base.py", line 230, in build_model |
|
[default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 672, in init_model |
|
[default2]:[rank50]: model = self._init_model_instance() |
|
[default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 682, in _init_model_instance |
|
[default2]:[rank50]: model = self._init_model( |
|
[default1]:[rank49]: block.build_and_set_rank(target_pp_ranks[target_pp_rank_idx]) |
|
[default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 52, in build_and_set_rank |
|
[default1]:[rank49]: self.pp_block = self.module_builder(**self.module_kwargs) |
|
[default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 613, in __init__ |
|
[default1]:[rank49]: self.attn = CausalSelfAttention( |
|
[default1]:[rank49]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 271, in __init__ |
|
[default1]:[rank49]: config.num_attention_heads % tp_pg.size() == 0 |
|
[default1]:[rank49]: AssertionError: Number of attention heads (32) must be divisible by TP size (64). |
|
[default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 751, in _init_model |
|
[default2]:[rank50]: model = build_model( |
|
[default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/base.py", line 230, in build_model |
|
[default2]:[rank50]: block.build_and_set_rank(target_pp_ranks[target_pp_rank_idx]) |
|
[default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 52, in build_and_set_rank |
|
[default2]:[rank50]: self.pp_block = self.module_builder(**self.module_kwargs) |
|
[default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 613, in __init__ |
|
[default2]:[rank50]: self.attn = CausalSelfAttention( |
|
[default2]:[rank50]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 271, in __init__ |
|
[default2]:[rank50]: config.num_attention_heads % tp_pg.size() == 0 |
|
[default2]:[rank50]: AssertionError: Number of attention heads (32) must be divisible by TP size (64). |
|
[default5]:[rank53]: Traceback (most recent call last): |
|
[default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in <module> |
|
[default5]:[rank53]: trainer = DistributedTrainer(config_file) |
|
[default3]:[rank51]: Traceback (most recent call last): |
|
[default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in <module> |
|
[default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 172, in __init__ |
|
[default5]:[rank53]: self.model = self.init_model() # Defines self.model |
|
[default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 672, in init_model |
|
[default3]:[rank51]: trainer = DistributedTrainer(config_file) |
|
[default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 172, in __init__ |
|
[default3]:[rank51]: self.model = self.init_model() # Defines self.model |
|
[default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 672, in init_model |
|
[default3]:[rank51]: model = self._init_model_instance() |
|
[default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 682, in _init_model_instance |
|
[default3]:[rank51]: model = self._init_model( |
|
[default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 751, in _init_model |
|
[default5]:[rank53]: model = self._init_model_instance() |
|
[default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 682, in _init_model_instance |
|
[default5]:[rank53]: model = self._init_model( |
|
[default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 751, in _init_model |
|
[default5]:[rank53]: model = build_model( |
|
[default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/base.py", line 230, in build_model |
|
[default5]:[rank53]: block.build_and_set_rank(target_pp_ranks[target_pp_rank_idx]) |
|
[default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 52, in build_and_set_rank |
|
[default5]:[rank53]: self.pp_block = self.module_builder(**self.module_kwargs) |
|
[default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 613, in __init__ |
|
[default3]:[rank51]: model = build_model( |
|
[default5]:[rank53]: self.attn = CausalSelfAttention( |
|
[default5]:[rank53]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 271, in __init__ |
|
[default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/base.py", line 230, in build_model |
|
[default5]:[rank53]: config.num_attention_heads % tp_pg.size() == 0 |
|
[default3]:[rank51]: block.build_and_set_rank(target_pp_ranks[target_pp_rank_idx]) |
|
[default5]:[rank53]: AssertionError: Number of attention heads (32) must be divisible by TP size (64). |
|
[default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 52, in build_and_set_rank |
|
[default3]:[rank51]: self.pp_block = self.module_builder(**self.module_kwargs) |
|
[default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 613, in __init__ |
|
[default3]:[rank51]: self.attn = CausalSelfAttention( |
|
[default3]:[rank51]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 271, in __init__ |
|
[default3]:[rank51]: config.num_attention_heads % tp_pg.size() == 0 |
|
[default3]:[rank51]: AssertionError: Number of attention heads (32) must be divisible by TP size (64). |
|
[default3]:[rank11]: Traceback (most recent call last): |
|
[default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in <module> |
|
[default3]:[rank11]: trainer = DistributedTrainer(config_file) |
|
[default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 172, in __init__ |
|
[default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 613, in __init__ |
|
[default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 672, in init_model |
|
[default4]:[rank12]: Traceback (most recent call last): |
|
[default4]:[rank36]: self.attn = CausalSelfAttention( |
|
[default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/base.py", line 230, in build_model |
|
[default3]:[rank11]: self.model = self.init_model() # Defines self.model |
|
[default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 672, in init_model |
|
[default3]:[rank11]: model = self._init_model_instance() |
|
[default4]:[rank36]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 271, in __init__ |
|
[default4]:[rank36]: config.num_attention_heads % tp_pg.size() == 0 |
|
[default2]:[rank2]: block.build_and_set_rank(target_pp_ranks[target_pp_rank_idx]) |
|
[default7]:[rank7]: AssertionError: Number of attention heads (32) must be divisible by TP size (64). |
|
[default3]:[rank3]: model = self._init_model_instance() |
|
[default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in <module> |
|
[default4]:[rank36]: AssertionError: Number of attention heads (32) must be divisible by TP size (64). |
|
[default5]:[rank37]: config.num_attention_heads % tp_pg.size() == 0 |
|
[default5]:[rank37]: AssertionError: Number of attention heads (32) must be divisible by TP size (64). |
|
[default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 682, in _init_model_instance |
|
[default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 682, in _init_model_instance |
|
[default3]:[rank11]: model = self._init_model( |
|
[default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 751, in _init_model |
|
[default1]:[rank33]: Traceback (most recent call last): |
|
[default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in <module> |
|
[default1]:[rank33]: trainer = DistributedTrainer(config_file) |
|
[default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 172, in __init__ |
|
[default1]:[rank33]: self.model = self.init_model() # Defines self.model |
|
[default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 672, in init_model |
|
[default1]:[rank33]: model = self._init_model_instance() |
|
[default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 682, in _init_model_instance |
|
[default1]:[rank33]: model = self._init_model( |
|
[default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 751, in _init_model |
|
[default1]:[rank33]: model = build_model( |
|
[def[default3]:[rank3]: model = self._init_model( |
|
[default4]:[rank12]: trainer = DistributedTrainer(config_file) |
|
ault1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/base.py", line 230, in build_model |
|
[default1]:[rank33]: block.build_and_set_rank(target_pp_ranks[target_pp_rank_idx]) |
|
[default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 52, in build_and_set_rank |
|
[default3]:[rank11]: model = build_model( |
|
[default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/base.py", line 230, in build_model |
|
[default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 52, in build_and_set_rank |
|
[default1]:[rank33]: self.pp_block = self.module_builder(**self.module_kwargs) |
|
[default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 613, in __init__ |
|
[default1]:[rank33]: self.attn = CausalSelfAttention( |
|
[default1]:[rank33]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 271, in __init__ |
|
[default1]:[rank33]: config.num_attention_heads % tp_pg.size() == 0 |
|
[default1]:[rank33]: AssertionError: Number of attention heads (32) must be divisible by TP size (64). |
|
[default2]:[rank2]: self.pp_block = self.module_builder(**self.module_kwargs) |
|
[default3]:[rank11]: block.build_and_set_rank(target_pp_ranks[target_pp_rank_idx]) |
|
[default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 613, in __init__ |
|
[default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 52, in build_and_set_rank |
|
[default3]:[rank11]: self.pp_block = self.module_builder(**self.module_kwargs) |
|
[default2]:[rank2]: self.attn = CausalSelfAttention( |
|
[default2]:[rank2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 271, in __init__ |
|
[default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 172, in __init__ |
|
[default2]:[rank2]: config.num_attention_heads % tp_pg.size() == 0 |
|
[default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 613, in __init__ |
|
[default4]:[rank12]: self.model = self.init_model() # Defines self.model |
|
[default2]:[rank2]: AssertionError: Number of attention heads (32) must be divisible by TP size (64). |
|
[default3]:[rank11]: self.attn = CausalSelfAttention( |
|
[default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 751, in _init_model |
|
[default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 672, in init_model |
|
[default3]:[rank11]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 271, in __init__ |
|
[default3]:[rank3]: model = build_model( |
|
[default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/base.py", line 230, in build_model |
|
[default3]:[rank3]: block.build_and_set_rank(target_pp_ranks[target_pp_rank_idx]) |
|
[default4]:[rank12]: model = self._init_model_instance() |
|
[default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 682, in _init_model_instance |
|
[default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 52, in build_and_set_rank |
|
[default4]:[rank12]: model = self._init_model( |
|
[default3]:[rank3]: self.pp_block = self.module_builder(**self.module_kwargs) |
|
[default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 613, in __init__ |
|
[default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 751, in _init_model |
|
[default3]:[rank3]: self.attn = CausalSelfAttention( |
|
[default3]:[rank3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 271, in __init__ |
|
[default4]:[rank12]: model = build_model( |
|
[default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/base.py", line 230, in build_model |
|
[default3]:[rank3]: config.num_attention_heads % tp_pg.size() == 0 |
|
[default3]:[rank3]: AssertionError: Number of attention heads (32) must be divisible by TP size (64). |
|
[default3]:[rank11]: config.num_attention_heads % tp_pg.size() == 0 |
|
[default6]:[rank6]: Traceback (most recent call last): |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in <module> |
|
[default4]:[rank12]: block.build_and_set_rank(target_pp_ranks[target_pp_rank_idx]) |
|
[default6]:[rank6]: trainer = DistributedTrainer(config_file) |
|
[default3]:[rank11]: AssertionError: Number of attention heads (32) must be divisible by TP size (64). |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 172, in __init__ |
|
[default6]:[rank6]: self.model = self.init_model() # Defines self.model |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 672, in init_model |
|
[default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 52, in build_and_set_rank |
|
[default4]:[rank12]: self.pp_block = self.module_builder(**self.module_kwargs) |
|
[default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 613, in __init__ |
|
[default4]:[rank12]: self.attn = CausalSelfAttention( |
|
[default4]:[rank12]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 271, in __init__ |
|
[default6]:[rank6]: model = self._init_model_instance() |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 682, in _init_model_instance |
|
[default6]:[rank6]: model = self._init_model( |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 751, in _init_model |
|
[default4]:[rank12]: config.num_attention_heads % tp_pg.size() == 0 |
|
[default4]:[rank12]: AssertionError: Number of attention heads (32) must be divisible by TP size (64). |
|
[default6]:[rank6]: model = build_model( |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/base.py", line 230, in build_model |
|
[default6]:[rank6]: block.build_and_set_rank(target_pp_ranks[target_pp_rank_idx]) |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 52, in build_and_set_rank |
|
[default6]:[rank6]: self.pp_block = self.module_builder(**self.module_kwargs) |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 613, in __init__ |
|
[default6]:[rank6]: self.attn = CausalSelfAttention( |
|
[default6]:[rank6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 271, in __init__ |
|
[default6]:[rank6]: config.num_attention_heads % tp_pg.size() == 0 |
|
[default6]:[rank6]: AssertionError: Number of attention heads (32) must be divisible by TP size (64). |
|
[default2]:[rank42]: Traceback (most recent call last): |
|
[default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in <module> |
|
[default2]:[rank42]: trainer = DistributedTrainer(config_file) |
|
[default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 172, in __init__ |
|
[default2]:[rank42]: self.model = self.init_model() # Defines self.model |
|
[default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 672, in init_model |
|
[default2]:[rank42]: model = self._init_model_instance() |
|
[default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 682, in _init_model_instance |
|
[default2]:[rank42]: model = self._init_model( |
|
[default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 751, in _init_model |
|
[default2]:[rank42]: model = build_model( |
|
[default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/base.py", line 230, in build_model |
|
[default2]:[rank42]: block.build_and_set_rank(target_pp_ranks[target_pp_rank_idx]) |
|
[default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 52, in build_and_set_rank |
|
[default2]:[rank42]: self.pp_block = self.module_builder(**self.module_kwargs) |
|
[default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 613, in __init__ |
|
[default2]:[rank42]: self.attn = CausalSelfAttention( |
|
[default2]:[rank42]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 271, in __init__ |
|
[default2]:[rank42]: config.num_attention_heads % tp_pg.size() == 0 |
|
[default2]:[rank42]: AssertionError: Number of attention heads (32) must be divisible by TP size (64). |
|
E0703 03:05:58.255000 140132832876352 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 884735) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 |
|
Traceback (most recent call last): |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in <module> |
|
sys.exit(main()) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper |
|
return f(*args, **kwargs) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main |
|
run(args) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run |
|
elastic_launch( |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ |
|
return launch_agent(self._config, self._entrypoint, list(args)) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent |
|
raise ChildFailedError( |
|
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: |
|
============================================================ |
|
/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED |
|
------------------------------------------------------------ |
|
Failures: |
|
[1]: |
|
time : 2024-07-03_03:05:58 |
|
host : ip-26-0-161-103.ec2.internal |
|
rank : 1 (local_rank: 1) |
|
exitcode : 1 (pid: 884736) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
[2]: |
|
time : 2024-07-03_03:05:58 |
|
host : ip-26-0-161-103.ec2.internal |
|
rank : 2 (local_rank: 2) |
|
exitcode : 1 (pid: 884737) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
[3]: |
|
time : 2024-07-03_03:05:58 |
|
host : ip-26-0-161-103.ec2.internal |
|
rank : 3 (local_rank: 3) |
|
exitcode : 1 (pid: 884738) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
[4]: |
|
time : 2024-07-03_03:05:58 |
|
host : ip-26-0-161-103.ec2.internal |
|
rank : 4 (local_rank: 4) |
|
exitcode : 1 (pid: 884739) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
[5]: |
|
time : 2024-07-03_03:05:58 |
|
host : ip-26-0-161-103.ec2.internal |
|
rank : 5 (local_rank: 5) |
|
exitcode : 1 (pid: 884740) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
[6]: |
|
time : 2024-07-03_03:05:58 |
|
host : ip-26-0-161-103.ec2.internal |
|
rank : 6 (local_rank: 6) |
|
exitcode : 1 (pid: 884741) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
[7]: |
|
time : 2024-07-03_03:05:58 |
|
host : ip-26-0-161-103.ec2.internal |
|
rank : 7 (local_rank: 7) |
|
exitcode : 1 (pid: 884742) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
------------------------------------------------------------ |
|
Root Cause (first observed failure): |
|
[0]: |
|
time : 2024-07-03_03:05:58 |
|
host : ip-26-0-161-103.ec2.internal |
|
rank : 0 (local_rank: 0) |
|
exitcode : 1 (pid: 884735) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
============================================================ |
|
E0703 03:05:58.351000 140159660582720 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 680884) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 |
|
E0703 03:05:58.352000 140472173479744 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 19203) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 |
|
E0703 03:05:58.353000 140086511552320 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 3780535) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 |
|
E0703 03:05:58.354000 140185577748288 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 898300) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 |
|
E0703 03:05:58.354000 140057300637504 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 1436450) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 |
|
E0703 03:05:58.354000 140551663458112 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 1160130) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 |
|
E0703 03:05:58.358000 140357741373248 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 3909934) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 |
|
Traceback (most recent call last): |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in <module> |
|
sys.exit(main()) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper |
|
Traceback (most recent call last): |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in <module> |
|
return f(*args, **kwargs) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main |
|
Traceback (most recent call last): |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in <module> |
|
sys.exit(main()) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper |
|
run(args) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run |
|
sys.exit(main()) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper |
|
return f(*args, **kwargs) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main |
|
Traceback (most recent call last): |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in <module> |
|
elastic_launch( |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ |
|
return f(*args, **kwargs) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main |
|
run(args) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run |
|
sys.exit(main()) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper |
|
return launch_agent(self._config, self._entrypoint, list(args)) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent |
|
run(args) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run |
|
elastic_launch( |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ |
|
return f(*args, **kwargs) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main |
|
raise ChildFailedError( |
|
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: |
|
============================================================ |
|
/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED |
|
------------------------------------------------------------ |
|
Failures: |
|
[1]: |
|
time : 2024-07-03_03:05:58 |
|
host : ip-26-0-161-153.ec2.internal |
|
rank : 17 (local_rank: 1) |
|
exitcode : 1 (pid: 1436451) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
[2]: |
|
time : 2024-07-03_03:05:58 |
|
host : ip-26-0-161-153.ec2.internal |
|
rank : 18 (local_rank: 2) |
|
exitcode : 1 (pid: 1436452) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
[3]: |
|
time : 2024-07-03_03:05:58 |
|
host : ip-26-0-161-153.ec2.internal |
|
rank : 19 (local_rank: 3) |
|
exitcode : 1 (pid: 1436453) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
[4]: |
|
time : 2024-07-03_03:05:58 |
|
host : ip-26-0-161-153.ec2.internal |
|
rank : 20 (local_rank: 4) |
|
exitcode : 1 (pid: 1436454) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
[5]: |
|
time : 2024-07-03_03:05:58 |
|
host : ip-26-0-161-153.ec2.internal |
|
rank : 21 (local_rank: 5) |
|
exitcode : 1 (pid: 1436455) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
[6]: |
|
time : 2024-07-03_03:05:58 |
|
host : ip-26-0-161-153.ec2.internal |
|
rank : 22 (local_rank: 6) |
|
exitcode : 1 (pid: 1436456) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
[7]: |
|
time : 2024-07-03_03:05:58 |
|
host : ip-26-0-161-153.ec2.internal |
|
rank : 23 (local_rank: 7) |
|
exitcode : 1 (pid: 1436457) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
------------------------------------------------------------ |
|
Root Cause (first observed failure): |
|
[0]: |
|
time : 2024-07-03_03:05:58 |
|
host : ip-26-0-161-153.ec2.internal |
|
rank : 16 (local_rank: 0) |
|
exitcode : 1 (pid: 1436450) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
============================================================ |
|
return launch_agent(self._config, self._entrypoint, list(args)) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent |
|
elastic_launch( |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ |
|
run(args) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run |
|
raise ChildFailedError( |
|
return launch_agent(self._config, self._entrypoint, list(args)) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent |
|
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: |
|
============================================================ |
|
/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED |
|
------------------------------------------------------------ |
|
Failures: |
|
[1]: |
|
time : 2024-07-03_03:05:58 |
|
host : ip-26-0-171-102.ec2.internal |
|
rank : 41 (local_rank: 1) |
|
exitcode : 1 (pid: 3780536) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
[2]: |
|
time : 2024-07-03_03:05:58 |
|
host : ip-26-0-171-102.ec2.internal |
|
rank : 42 (local_rank: 2) |
|
exitcode : 1 (pid: 3780537) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
[3]: |
|
time : 2024-07-03_03:05:58 |
|
host : ip-26-0-171-102.ec2.internal |
|
rank : 43 (local_rank: 3) |
|
exitcode : 1 (pid: 3780538) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
[4]: |
|
time : 2024-07-03_03:05:58 |
|
host : ip-26-0-171-102.ec2.internal |
|
rank : 44 (local_rank: 4) |
|
exitcode : 1 (pid: 3780539) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
[5]: |
|
time : 2024-07-03_03:05:58 |
|
host : ip-26-0-171-102.ec2.internal |
|
rank : 45 (local_rank: 5) |
|
exitcode : 1 (pid: 3780540) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
[6]: |
|
time : 2024-07-03_03:05:58 |
|
host : ip-26-0-171-102.ec2.internal |
|
rank : 46 (local_rank: 6) |
|
exitcode : 1 (pid: 3780541) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
[7]: |
|
time : 2024-07-03_03:05:58 |
|
host : ip-26-0-171-102.ec2.internal |
|
rank : 47 (local_rank: 7) |
|
exitcode : 1 (pid: 3780542) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
------------------------------------------------------------ |
|
Root Cause (first observed failure): |
|
[0]: |
|
time : 2024-07-03_03:05:58 |
|
host : ip-26-0-171-102.ec2.internal |
|
rank : 40 (local_rank: 0) |
|
exitcode : 1 (pid: 3780535) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
============================================================ |
|
elastic_launch( |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ |
|
raise ChildFailedError( |
|
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: |
|
============================================================ |
|
/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED |
|
------------------------------------------------------------ |
|
Failures: |
|
[1]: |
|
time : 2024-07-03_03:05:58 |
|
host : ip-26-0-166-125.ec2.internal |
|
rank : 33 (local_rank: 1) |
|
exitcode : 1 (pid: 19204) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
[2]: |
|
time : 2024-07-03_03:05:58 |
|
host : ip-26-0-166-125.ec2.internal |
|
rank : 34 (local_rank: 2) |
|
exitcode : 1 (pid: 19205) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
[3]: |
|
time : 2024-07-03_03:05:58 |
|
host : ip-26-0-166-125.ec2.internal |
|
rank : 35 (local_rank: 3) |
|
exitcode : 1 (pid: 19206) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
[4]: |
|
time : 2024-07-03_03:05:58 |
|
host : ip-26-0-166-125.ec2.internal |
|
rank : 36 (local_rank: 4) |
|
exitcode : 1 (pid: 19207) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
[5]: |
|
time : 2024-07-03_03:05:58 |
|
host : ip-26-0-166-125.ec2.internal |
|
rank : 37 (local_rank: 5) |
|
exitcode : 1 (pid: 19208) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
[6]: |
|
time : 2024-07-03_03:05:58 |
|
host : ip-26-0-166-125.ec2.internal |
|
rank : 38 (local_rank: 6) |
|
exitcode : 1 (pid: 19209) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
[7]: |
|
time : 2024-07-03_03:05:58 |
|
host : ip-26-0-166-125.ec2.internal |
|
rank : 39 (local_rank: 7) |
|
exitcode : 1 (pid: 19210) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
------------------------------------------------------------ |
|
Root Cause (first observed failure): |
|
[0]: |
|
time : 2024-07-03_03:05:58 |
|
host : ip-26-0-166-125.ec2.internal |
|
rank : 32 (local_rank: 0) |
|
exitcode : 1 (pid: 19203) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
============================================================ |
|
return launch_agent(self._config, self._entrypoint, list(args)) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent |
|
Traceback (most recent call last): |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in <module> |
|
raise ChildFailedError( |
|
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: |
|
============================================================ |
|
/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED |
|
------------------------------------------------------------ |
|
Failures: |
|
[1]: |
|
time : 2024-07-03_03:05:58 |
|
host : ip-26-0-161-138.ec2.internal |
|
rank : 9 (local_rank: 1) |
|
exitcode : 1 (pid: 680885) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
[2]: |
|
time : 2024-07-03_03:05:58 |
|
host : ip-26-0-161-138.ec2.internal |
|
rank : 10 (local_rank: 2) |
|
exitcode : 1 (pid: 680886) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
[3]: |
|
time : 2024-07-03_03:05:58 |
|
host : ip-26-0-161-138.ec2.internal |
|
rank : 11 (local_rank: 3) |
|
exitcode : 1 (pid: 680887) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
[4]: |
|
time : 2024-07-03_03:05:58 |
|
host : ip-26-0-161-138.ec2.internal |
|
rank : 12 (local_rank: 4) |
|
exitcode : 1 (pid: 680888) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
[5]: |
|
time : 2024-07-03_03:05:58 |
|
host : ip-26-0-161-138.ec2.internal |
|
rank : 13 (local_rank: 5) |
|
exitcode : 1 (pid: 680889) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
[6]: |
|
time : 2024-07-03_03:05:58 |
|
host : ip-26-0-161-138.ec2.internal |
|
rank : 14 (local_rank: 6) |
|
exitcode : 1 (pid: 680890) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
[7]: |
|
time : 2024-07-03_03:05:58 |
|
host : ip-26-0-161-138.ec2.internal |
|
rank : 15 (local_rank: 7) |
|
exitcode : 1 (pid: 680891) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
------------------------------------------------------------ |
|
Root Cause (first observed failure): |
|
[0]: |
|
time : 2024-07-03_03:05:58 |
|
host : ip-26-0-161-138.ec2.internal |
|
rank : 8 (local_rank: 0) |
|
exitcode : 1 (pid: 680884) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
============================================================ |
|
sys.exit(main()) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper |
|
return f(*args, **kwargs) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main |
|
run(args) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run |
|
elastic_launch( |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ |
|
return launch_agent(self._config, self._entrypoint, list(args)) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent |
|
Traceback (most recent call last): |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in <module> |
|
raise ChildFailedError( |
|
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: |
|
============================================================ |
|
/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED |
|
------------------------------------------------------------ |
|
Failures: |
|
[1]: |
|
time : 2024-07-03_03:05:58 |
|
host : ip-26-0-171-88.ec2.internal |
|
rank : 57 (local_rank: 1) |
|
exitcode : 1 (pid: 898301) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
[2]: |
|
time : 2024-07-03_03:05:58 |
|
host : ip-26-0-171-88.ec2.internal |
|
rank : 58 (local_rank: 2) |
|
exitcode : 1 (pid: 898302) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
[3]: |
|
time : 2024-07-03_03:05:58 |
|
host : ip-26-0-171-88.ec2.internal |
|
rank : 59 (local_rank: 3) |
|
exitcode : 1 (pid: 898303) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
[4]: |
|
time : 2024-07-03_03:05:58 |
|
host : ip-26-0-171-88.ec2.internal |
|
rank : 60 (local_rank: 4) |
|
exitcode : 1 (pid: 898304) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
[5]: |
|
time : 2024-07-03_03:05:58 |
|
host : ip-26-0-171-88.ec2.internal |
|
rank : 61 (local_rank: 5) |
|
exitcode : 1 (pid: 898305) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
[6]: |
|
time : 2024-07-03_03:05:58 |
|
host : ip-26-0-171-88.ec2.internal |
|
rank : 62 (local_rank: 6) |
|
exitcode : 1 (pid: 898306) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
[7]: |
|
time : 2024-07-03_03:05:58 |
|
host : ip-26-0-171-88.ec2.internal |
|
rank : 63 (local_rank: 7) |
|
exitcode : 1 (pid: 898307) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
------------------------------------------------------------ |
|
Root Cause (first observed failure): |
|
[0]: |
|
time : 2024-07-03_03:05:58 |
|
host : ip-26-0-171-88.ec2.internal |
|
rank : 56 (local_rank: 0) |
|
exitcode : 1 (pid: 898300) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
============================================================ |
|
sys.exit(main()) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper |
|
return f(*args, **kwargs) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main |
|
run(args) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run |
|
elastic_launch( |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ |
|
return launch_agent(self._config, self._entrypoint, list(args)) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent |
|
raise ChildFailedError( |
|
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: |
|
============================================================ |
|
/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED |
|
------------------------------------------------------------ |
|
Failures: |
|
[1]: |
|
time : 2024-07-03_03:05:58 |
|
host : ip-26-0-171-62.ec2.internal |
|
rank : 49 (local_rank: 1) |
|
exitcode : 1 (pid: 3909935) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
[2]: |
|
time : 2024-07-03_03:05:58 |
|
host : ip-26-0-171-62.ec2.internal |
|
rank : 50 (local_rank: 2) |
|
exitcode : 1 (pid: 3909936) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
[3]: |
|
time : 2024-07-03_03:05:58 |
|
host : ip-26-0-171-62.ec2.internal |
|
rank : 51 (local_rank: 3) |
|
exitcode : 1 (pid: 3909937) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
[4]: |
|
time : 2024-07-03_03:05:58 |
|
host : ip-26-0-171-62.ec2.internal |
|
rank : 52 (local_rank: 4) |
|
exitcode : 1 (pid: 3909938) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
[5]: |
|
time : 2024-07-03_03:05:58 |
|
host : ip-26-0-171-62.ec2.internal |
|
rank : 53 (local_rank: 5) |
|
exitcode : 1 (pid: 3909939) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
[6]: |
|
time : 2024-07-03_03:05:58 |
|
host : ip-26-0-171-62.ec2.internal |
|
rank : 54 (local_rank: 6) |
|
exitcode : 1 (pid: 3909940) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
[7]: |
|
time : 2024-07-03_03:05:58 |
|
host : ip-26-0-171-62.ec2.internal |
|
rank : 55 (local_rank: 7) |
|
exitcode : 1 (pid: 3909941) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
------------------------------------------------------------ |
|
Root Cause (first observed failure): |
|
[0]: |
|
time : 2024-07-03_03:05:58 |
|
host : ip-26-0-171-62.ec2.internal |
|
rank : 48 (local_rank: 0) |
|
exitcode : 1 (pid: 3909934) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
============================================================ |
|
Traceback (most recent call last): |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in <module> |
|
sys.exit(main()) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper |
|
return f(*args, **kwargs) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main |
|
run(args) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run |
|
elastic_launch( |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ |
|
return launch_agent(self._config, self._entrypoint, list(args)) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent |
|
raise ChildFailedError( |
|
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: |
|
============================================================ |
|
/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED |
|
------------------------------------------------------------ |
|
Failures: |
|
[1]: |
|
time : 2024-07-03_03:05:58 |
|
host : ip-26-0-161-78.ec2.internal |
|
rank : 25 (local_rank: 1) |
|
exitcode : 1 (pid: 1160131) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
[2]: |
|
time : 2024-07-03_03:05:58 |
|
host : ip-26-0-161-78.ec2.internal |
|
rank : 26 (local_rank: 2) |
|
exitcode : 1 (pid: 1160132) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
[3]: |
|
time : 2024-07-03_03:05:58 |
|
host : ip-26-0-161-78.ec2.internal |
|
rank : 27 (local_rank: 3) |
|
exitcode : 1 (pid: 1160133) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
[4]: |
|
time : 2024-07-03_03:05:58 |
|
host : ip-26-0-161-78.ec2.internal |
|
rank : 28 (local_rank: 4) |
|
exitcode : 1 (pid: 1160134) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
[5]: |
|
time : 2024-07-03_03:05:58 |
|
host : ip-26-0-161-78.ec2.internal |
|
rank : 29 (local_rank: 5) |
|
exitcode : 1 (pid: 1160135) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
[6]: |
|
time : 2024-07-03_03:05:58 |
|
host : ip-26-0-161-78.ec2.internal |
|
rank : 30 (local_rank: 6) |
|
exitcode : 1 (pid: 1160136) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
[7]: |
|
time : 2024-07-03_03:05:58 |
|
host : ip-26-0-161-78.ec2.internal |
|
rank : 31 (local_rank: 7) |
|
exitcode : 1 (pid: 1160137) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
------------------------------------------------------------ |
|
Root Cause (first observed failure): |
|
[0]: |
|
time : 2024-07-03_03:05:58 |
|
host : ip-26-0-161-78.ec2.internal |
|
rank : 24 (local_rank: 0) |
|
exitcode : 1 (pid: 1160130) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
============================================================ |
|
srun: error: ip-26-0-161-103: task 1: Exited with exit code 1 |
|
srun: error: ip-26-0-171-88: task 6: Exited with exit code 1 |
|
srun: error: ip-26-0-161-78: task 0: Exited with exit code 1 |
|
srun: error: ip-26-0-166-125: task 4: Exited with exit code 1 |
|
srun: error: ip-26-0-161-138: task 2: Exited with exit code 1 |
|
srun: error: ip-26-0-171-62: task 5: Exited with exit code 1 |
|
srun: error: ip-26-0-171-102: task 7: Exited with exit code 1 |
|
srun: error: ip-26-0-161-153: task 3: Exited with exit code 1 |
|
Consider using `hf_transfer` for faster uploads. This solution comes with some limitations. See https://huggingface.co/docs/huggingface_hub/hf_transfer for more details. |
|
|