/home/AdamG012/.conda/envs/py39/lib/python3.9/site-packages/requests/__init__.py:109: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.4) doesn't match a supported version! warnings.warn( [2023-04-14 07:44:46,752] [WARNING] [runner.py:190:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2023-04-14 07:44:48,341] [INFO] [runner.py:540:main] cmd = /home/AdamG012/.conda/envs/py39/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None main.py --data_path Dahoas/rm-static Dahoas/full-hh-rlhf Dahoas/synthetic-instruct-gptj-pairwise yitingxie/rlhf-reward-datasets openai/webgpt_comparisons stanfordnlp/SHP --data_split 2,4,4 --model_name_or_path facebook/opt-350m --num_padding_at_beginning 1 --per_device_train_batch_size 8 --per_device_eval_batch_size 8 --max_seq_len 512 --learning_rate 5e-5 --weight_decay 0.1 --num_train_epochs 1 --gradient_accumulation_steps 1 --lr_scheduler_type cosine --num_warmup_steps 0 --seed 1234 --zero_stage 0 --deepspeed --output_dir /chatgpt/hf_runs/DeepSpeedExamples/applications/DeepSpeed-Chat/output/reward-models/350m /home/AdamG012/.conda/envs/py39/lib/python3.9/site-packages/requests/__init__.py:109: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.4) doesn't match a supported version! warnings.warn( [2023-04-14 07:45:54,441] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]} [2023-04-14 07:45:54,643] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=8, node_rank=0 [2023-04-14 07:45:54,643] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}) [2023-04-14 07:45:54,643] [INFO] [launch.py:247:main] dist_world_size=8 [2023-04-14 07:45:54,643] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 /home/AdamG012/.conda/envs/py39/lib/python3.9/site-packages/requests/__init__.py:109: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.4) doesn't match a supported version! warnings.warn( /home/AdamG012/.conda/envs/py39/lib/python3.9/site-packages/requests/__init__.py:109: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.4) doesn't match a supported version! warnings.warn( /home/AdamG012/.conda/envs/py39/lib/python3.9/site-packages/requests/__init__.py:109: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.4) doesn't match a supported version! warnings.warn( /home/AdamG012/.conda/envs/py39/lib/python3.9/site-packages/requests/__init__.py:109: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.4) doesn't match a supported version! warnings.warn( /home/AdamG012/.conda/envs/py39/lib/python3.9/site-packages/requests/__init__.py:109: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.4) doesn't match a supported version! warnings.warn( /home/AdamG012/.conda/envs/py39/lib/python3.9/site-packages/requests/__init__.py:109: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.4) doesn't match a supported version! warnings.warn( /home/AdamG012/.conda/envs/py39/lib/python3.9/site-packages/requests/__init__.py:109: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.4) doesn't match a supported version! warnings.warn( /home/AdamG012/.conda/envs/py39/lib/python3.9/site-packages/requests/__init__.py:109: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.4) doesn't match a supported version! warnings.warn( [2023-04-14 07:49:22,604] [INFO] [comm.py:586:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl Downloading (…)okenizer_config.json: 0%| | 0.00/685 [00:00 [2023-04-14 08:09:50,233] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[5e-05, 5e-05], mom=[(0.9, 0.95), (0.9, 0.95)] Using /home/AdamG012/.cache/torch_extensions/py39_cu113 as PyTorch extensions root... [2023-04-14 08:09:50,234] [INFO] [config.py:953:print] DeepSpeedEngine configuration: [2023-04-14 08:09:50,235] [INFO] [config.py:957:print] activation_checkpointing_config { "partition_activations": false, "contiguous_memory_optimization": false, "cpu_checkpointing": false, "number_checkpoints": null, "synchronize_checkpoint_boundary": false, "profile": false } [2023-04-14 08:09:50,235] [INFO] [config.py:957:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True} [2023-04-14 08:09:50,235] [INFO] [config.py:957:print] amp_enabled .................. False [2023-04-14 08:09:50,235] [INFO] [config.py:957:print] amp_params ................... False [2023-04-14 08:09:50,235] [INFO] [config.py:957:print] autotuning_config ............ { "enabled": false, "start_step": null, "end_step": null, "metric_path": null, "arg_mappings": null, "metric": "throughput", "model_info": null, "results_dir": "autotuning_results", "exps_dir": "autotuning_exps", "overwrite": true, "fast": true, "start_profile_step": 3, "end_profile_step": 5, "tuner_type": "gridsearch", "tuner_early_stopping": 5, "tuner_num_trials": 50, "model_info_path": null, "mp_size": 1, "max_train_batch_size": null, "min_train_batch_size": 1, "max_train_micro_batch_size_per_gpu": 1.024000e+03, "min_train_micro_batch_size_per_gpu": 1, "num_tuning_micro_batch_sizes": 3 } [2023-04-14 08:09:50,235] [INFO] [config.py:957:print] bfloat16_enabled ............. False [2023-04-14 08:09:50,235] [INFO] [config.py:957:print] checkpoint_parallel_write_pipeline False [2023-04-14 08:09:50,235] [INFO] [config.py:957:print] checkpoint_tag_validation_enabled True [2023-04-14 08:09:50,235] [INFO] [config.py:957:print] checkpoint_tag_validation_fail False [2023-04-14 08:09:50,235] [INFO] [config.py:957:print] comms_config ................. [2023-04-14 08:09:50,235] [INFO] [config.py:957:print] communication_data_type ...... None [2023-04-14 08:09:50,235] [INFO] [config.py:957:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} [2023-04-14 08:09:50,235] [INFO] [config.py:957:print] curriculum_enabled_legacy .... False Using /home/AdamG012/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...[2023-04-14 08:09:50,235] [INFO] [config.py:957:print] curriculum_params_legacy ..... False [2023-04-14 08:09:50,236] [INFO] [config.py:957:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} [2023-04-14 08:09:50,236] [INFO] [config.py:957:print] data_efficiency_enabled ...... False [2023-04-14 08:09:50,236] [INFO] [config.py:957:print] dataloader_drop_last ......... False [2023-04-14 08:09:50,236] [INFO] [config.py:957:print] disable_allgather ............ False [2023-04-14 08:09:50,236] [INFO] [config.py:957:print] dump_state ................... False [2023-04-14 08:09:50,236] [INFO] [config.py:957:print] dynamic_loss_scale_args ...... {'init_scale': 65536, 'scale_window': 100, 'delayed_shift': 2, 'min_scale': 1} [2023-04-14 08:09:50,236] [INFO] [config.py:957:print] eigenvalue_enabled ........... False [2023-04-14 08:09:50,236] [INFO] [config.py:957:print] eigenvalue_gas_boundary_resolution 1 [2023-04-14 08:09:50,236] [INFO] [config.py:957:print] eigenvalue_layer_name ........ bert.encoder.layer [2023-04-14 08:09:50,236] [INFO] [config.py:957:print] eigenvalue_layer_num ......... 0 [2023-04-14 08:09:50,236] [INFO] [config.py:957:print] eigenvalue_max_iter .......... 100 [2023-04-14 08:09:50,236] [INFO] [config.py:957:print] eigenvalue_stability ......... 1e-06 [2023-04-14 08:09:50,236] [INFO] [config.py:957:print] eigenvalue_tol ............... 0.01 [2023-04-14 08:09:50,236] [INFO] [config.py:957:print] eigenvalue_verbose ........... False [2023-04-14 08:09:50,236] [INFO] [config.py:957:print] elasticity_enabled ........... False Using /home/AdamG012/.cache/torch_extensions/py39_cu113 as PyTorch extensions root... [2023-04-14 08:09:50,236] [INFO] [config.py:957:print] flops_profiler_config ........ { "enabled": false, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null } [2023-04-14 08:09:50,236] [INFO] [config.py:957:print] fp16_auto_cast ............... False [2023-04-14 08:09:50,236] [INFO] [config.py:957:print] fp16_enabled ................. True [2023-04-14 08:09:50,236] [INFO] [config.py:957:print] fp16_master_weights_and_gradients False [2023-04-14 08:09:50,236] [INFO] [config.py:957:print] global_rank .................. 0 [2023-04-14 08:09:50,236] [INFO] [config.py:957:print] grad_accum_dtype ............. None [2023-04-14 08:09:50,236] [INFO] [config.py:957:print] gradient_accumulation_steps .. 1 [2023-04-14 08:09:50,236] [INFO] [config.py:957:print] gradient_clipping ............ 1.0 [2023-04-14 08:09:50,236] [INFO] [config.py:957:print] gradient_predivide_factor .... 1.0 [2023-04-14 08:09:50,236] [INFO] [config.py:957:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8 [2023-04-14 08:09:50,236] [INFO] [config.py:957:print] initial_dynamic_scale ........ 65536 [2023-04-14 08:09:50,236] [INFO] [config.py:957:print] load_universal_checkpoint .... False [2023-04-14 08:09:50,236] [INFO] [config.py:957:print] loss_scale ................... 0 [2023-04-14 08:09:50,236] [INFO] [config.py:957:print] memory_breakdown ............. False [2023-04-14 08:09:50,236] [INFO] [config.py:957:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False [2023-04-14 08:09:50,236] [INFO] [config.py:957:print] nebula_config ................ { "enabled": false, "persistent_storage_path": null, "persistent_time_interval": 100, "num_of_version_in_retention": 2, "enable_nebula_load": true, "load_path": null } [2023-04-14 08:09:50,236] [INFO] [config.py:957:print] optimizer_legacy_fusion ...... False [2023-04-14 08:09:50,236] [INFO] [config.py:957:print] optimizer_name ............... None [2023-04-14 08:09:50,236] [INFO] [config.py:957:print] optimizer_params ............. None [2023-04-14 08:09:50,236] [INFO] [config.py:957:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0} [2023-04-14 08:09:50,236] [INFO] [config.py:957:print] pld_enabled .................. False [2023-04-14 08:09:50,236] [INFO] [config.py:957:print] pld_params ................... False [2023-04-14 08:09:50,236] [INFO] [config.py:957:print] prescale_gradients ........... False [2023-04-14 08:09:50,236] [INFO] [config.py:957:print] scheduler_name ............... None [2023-04-14 08:09:50,236] [INFO] [config.py:957:print] scheduler_params ............. None [2023-04-14 08:09:50,236] [INFO] [config.py:957:print] sparse_attention ............. None [2023-04-14 08:09:50,237] [INFO] [config.py:957:print] sparse_gradients_enabled ..... False [2023-04-14 08:09:50,237] [INFO] [config.py:957:print] steps_per_print .............. 10 [2023-04-14 08:09:50,237] [INFO] [config.py:957:print] train_batch_size ............. 64 [2023-04-14 08:09:50,237] [INFO] [config.py:957:print] train_micro_batch_size_per_gpu 8 [2023-04-14 08:09:50,237] [INFO] [config.py:957:print] use_node_local_storage ....... False [2023-04-14 08:09:50,237] [INFO] [config.py:957:print] wall_clock_breakdown ......... False [2023-04-14 08:09:50,237] [INFO] [config.py:957:print] world_size ................... 8 [2023-04-14 08:09:50,237] [INFO] [config.py:957:print] zero_allow_untested_optimizer False [2023-04-14 08:09:50,237] [INFO] [config.py:957:print] zero_config .................. stage=0 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='none', nvme_path=None, buffer_count=4, pin_memory=False, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=30000000 param_persistence_threshold=10000 model_persistence_threshold=sys.maxsize max_live_parameters=30000000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False memory_efficient_linear=False [2023-04-14 08:09:50,237] [INFO] [config.py:957:print] zero_enabled ................. False [2023-04-14 08:09:50,237] [INFO] [config.py:957:print] zero_force_ds_cpu_optimizer .. True [2023-04-14 08:09:50,237] [INFO] [config.py:957:print] zero_optimization_stage ...... 0 [2023-04-14 08:09:50,237] [INFO] [config.py:943:print_user_config] json = { "train_batch_size": 64, "train_micro_batch_size_per_gpu": 8, "steps_per_print": 10, "zero_optimization": { "stage": 0, "offload_param": { "device": "none" }, "offload_optimizer": { "device": "none" }, "stage3_param_persistence_threshold": 1.000000e+04, "stage3_max_live_parameters": 3.000000e+07, "stage3_prefetch_bucket_size": 3.000000e+07, "memory_efficient_linear": false }, "fp16": { "enabled": true, "loss_scale_window": 100 }, "gradient_clipping": 1.0, "prescale_gradients": false, "wall_clock_breakdown": false, "hybrid_engine": { "enabled": false, "inference_tp_size": 1, "release_inference_cache": false, "pin_parameters": true, "tp_gather_partition_size": 8 } } Using /home/AdamG012/.cache/torch_extensions/py39_cu113 as PyTorch extensions root... huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using `tokenizers` before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using `tokenizers` before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using `tokenizers` before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) Emitting ninja build file /home/AdamG012/.cache/torch_extensions/py39_cu113/utils/build.ninja... Building extension module utils... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using `tokenizers` before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) ninja: no work to do. Loading extension module utils... Time to load utils op: 9.61396598815918 seconds Loading extension module utils... Time to load utils op: 9.616131782531738 seconds Loading extension module utils... Loading extension module utils... Loading extension module utils... Time to load utils op: 9.616359233856201 seconds Loading extension module utils... Time to load utils op: 9.614547967910767 seconds Time to load utils op: 9.615831136703491 seconds Time to load utils op: 9.614869594573975 seconds Loading extension module utils... Loading extension module utils... Time to load utils op: 9.616759538650513 seconds Time to load utils op: 9.61654543876648 seconds ***** Running training ***** ***** Evaluating reward, Epoch 0/1 ***** chosen_last_scores (higher is better) : 2.70546555519104, acc (higher is better) : 0.4709596335887909 Beginning of Epoch 1/1, Total Micro Batches 4130 [2023-04-14 08:10:09,293] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 0 [2023-04-14 08:10:09,293] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 0 [2023-04-14 08:10:09,293] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 0 [2023-04-14 08:10:09,293] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 65536 to 32768.0 [2023-04-14 08:10:09,293] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 65536 to 32768.0 [2023-04-14 08:10:09,293] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 65536 to 32768.0 [2023-04-14 08:10:09,293] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 0 [2023-04-14 08:10:09,293] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 0 [2023-04-14 08:10:09,293] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 0 [2023-04-14 08:10:09,293] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 65536 to 32768.0 [2023-04-14 08:10:09,293] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 65536 to 32768.0 [2023-04-14 08:10:09,293] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 65536 to 32768.0 [2023-04-14 08:10:09,293] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 65536, reducing to 32768.0 [2023-04-14 08:10:09,293] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 0 [2023-04-14 08:10:09,293] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 65536 to 32768.0 [2023-04-14 08:10:09,293] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 0 [2023-04-14 08:10:09,293] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 65536 to 32768.0 [2023-04-14 08:10:09,527] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1 [2023-04-14 08:10:09,527] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:10:09,527] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1 [2023-04-14 08:10:09,527] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0 [2023-04-14 08:10:09,527] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1 [2023-04-14 08:10:09,527] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:10:09,527] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:10:09,527] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1 [2023-04-14 08:10:09,527] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1 [2023-04-14 08:10:09,527] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1 [2023-04-14 08:10:09,527] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1 [2023-04-14 08:10:09,527] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:10:09,527] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:10:09,527] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:10:09,527] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1 [2023-04-14 08:10:09,527] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:10:09,527] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:10:09,761] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2 [2023-04-14 08:10:09,761] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2 [2023-04-14 08:10:09,761] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2 [2023-04-14 08:10:10,162] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-14 08:10:09,761] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2 [2023-04-14 08:10:10,162] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-14 08:10:09,761] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2 [2023-04-14 08:10:10,162] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-14 08:10:10,162] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-14 08:10:09,761] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2 [2023-04-14 08:10:09,761] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2 [2023-04-14 08:10:10,162] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-14 08:10:09,761] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2 [2023-04-14 08:10:10,163] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-14 08:10:10,163] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-14 08:10:10,163] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-14 08:10:10,163] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 16384.0, reducing to 8192.0 [2023-04-14 08:10:10,398] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3 [2023-04-14 08:10:10,398] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-04-14 08:10:10,398] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3 [2023-04-14 08:10:10,398] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 8192.0, reducing to 4096.0 [2023-04-14 08:10:10,398] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-04-14 08:10:10,398] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3 [2023-04-14 08:10:10,398] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3 [2023-04-14 08:10:10,398] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3 [2023-04-14 08:10:10,398] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-04-14 08:10:10,398] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-04-14 08:10:10,398] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-04-14 08:10:10,398] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3 [2023-04-14 08:10:10,398] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-04-14 08:10:10,398] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3 [2023-04-14 08:10:10,399] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-04-14 08:10:10,401] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3 [2023-04-14 08:10:10,401] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-04-14 08:10:10,636] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4 [2023-04-14 08:10:10,636] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4 [2023-04-14 08:10:10,636] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-04-14 08:10:10,636] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-04-14 08:10:10,636] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4 [2023-04-14 08:10:10,636] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4 [2023-04-14 08:10:10,636] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4 [2023-04-14 08:10:10,636] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4 [2023-04-14 08:10:10,636] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4 [2023-04-14 08:10:10,636] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 4 [2023-04-14 08:10:10,636] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-04-14 08:10:10,636] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 4096.0, reducing to 2048.0 [2023-04-14 08:10:10,636] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-04-14 08:10:10,636] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-04-14 08:10:10,636] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-04-14 08:10:10,636] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-04-14 08:10:10,636] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 4096.0 to 2048.0 [2023-04-14 08:10:10,869] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5 [2023-04-14 08:10:10,869] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5 [2023-04-14 08:10:10,869] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5 [2023-04-14 08:10:10,869] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 2048.0 to 1024.0 [2023-04-14 08:10:10,869] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5 [2023-04-14 08:10:10,869] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5 [2023-04-14 08:10:10,869] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 2048.0 to 1024.0 [2023-04-14 08:10:10,869] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5 [2023-04-14 08:10:10,869] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 2048.0 to 1024.0 [2023-04-14 08:10:10,869] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 2048.0 to 1024.0 [2023-04-14 08:10:10,869] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 2048.0 to 1024.0 [2023-04-14 08:10:10,869] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 2048.0 to 1024.0 [2023-04-14 08:10:10,869] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 2048.0, reducing to 1024.0 [2023-04-14 08:10:10,869] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5 [2023-04-14 08:10:10,869] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 5 [2023-04-14 08:10:10,869] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 2048.0 to 1024.0 [2023-04-14 08:10:10,869] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 2048.0 to 1024.0 [2023-04-14 08:10:11,102] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6 [2023-04-14 08:10:11,102] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 1024.0 to 512.0 [2023-04-14 08:10:11,102] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6 [2023-04-14 08:10:11,102] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6 [2023-04-14 08:10:11,102] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6 [2023-04-14 08:10:11,102] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 1024.0 to 512.0 [2023-04-14 08:10:11,102] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 1024.0 to 512.0 [2023-04-14 08:10:11,102] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6 [2023-04-14 08:10:11,102] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6 [2023-04-14 08:10:11,102] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 1024.0 to 512.0 [2023-04-14 08:10:11,103] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 1024.0, reducing to 512.0 [2023-04-14 08:10:11,103] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6 [2023-04-14 08:10:11,103] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 1024.0 to 512.0 [2023-04-14 08:10:11,102] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 6 [2023-04-14 08:10:11,103] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 1024.0 to 512.0 [2023-04-14 08:10:11,103] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 1024.0 to 512.0 [2023-04-14 08:10:11,103] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 1024.0 to 512.0 [2023-04-14 08:10:11,337] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 7 [2023-04-14 08:10:11,337] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 512.0 to 256.0 [2023-04-14 08:10:11,337] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 7 [2023-04-14 08:10:11,337] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 7 [2023-04-14 08:10:11,337] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 512.0 to 256.0 [2023-04-14 08:10:11,337] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 7 [2023-04-14 08:10:11,337] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 7 [2023-04-14 08:10:11,337] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 512.0 to 256.0 [2023-04-14 08:10:11,337] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 7 [2023-04-14 08:10:11,337] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 512.0, reducing to 256.0 [2023-04-14 08:10:11,337] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 512.0 to 256.0 [2023-04-14 08:10:11,337] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 512.0 to 256.0 [2023-04-14 08:10:11,337] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 512.0 to 256.0 [2023-04-14 08:10:11,337] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 7 [2023-04-14 08:10:11,337] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 512.0 to 256.0 [2023-04-14 08:10:11,337] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 7 [2023-04-14 08:10:11,338] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 512.0 to 256.0 [2023-04-14 08:10:11,840] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 9 [2023-04-14 08:10:11,840] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 9 [2023-04-14 08:10:11,840] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 256.0 to 128.0 [2023-04-14 08:10:11,840] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 9 [2023-04-14 08:10:11,840] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 9 [2023-04-14 08:10:11,840] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 256.0 to 128.0 [2023-04-14 08:10:11,840] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 9 [2023-04-14 08:10:11,840] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 256.0 to 128.0 [2023-04-14 08:10:11,840] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 256.0 to 128.0 [2023-04-14 08:10:11,840] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 256.0, reducing to 128.0 [2023-04-14 08:10:11,840] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 9 [2023-04-14 08:10:11,840] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 9 [2023-04-14 08:10:11,840] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 256.0 to 128.0 [2023-04-14 08:10:11,840] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 256.0 to 128.0 [2023-04-14 08:10:11,840] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 256.0 to 128.0 [2023-04-14 08:10:11,840] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 9 [2023-04-14 08:10:11,840] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 256.0 to 128.0 [2023-04-14 08:10:11,841] [INFO] [logging.py:96:log_dist] [Rank 0] step=10, skipped=9, lr=[4.9999992767147075e-05, 4.9999992767147075e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:10:11,841] [INFO] [timer.py:199:stop] epoch=0/micro_step=10/global_step=10, RunningAvgSamplesPerSec=221.7406685188736, CurrSamplesPerSec=274.42309347656516, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:10:12,073] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 10 [2023-04-14 08:10:12,073] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 128.0 to 64.0 [2023-04-14 08:10:12,073] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 10 [2023-04-14 08:10:12,073] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 128.0 to 64.0 [2023-04-14 08:10:12,073] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 10 [2023-04-14 08:10:12,073] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 10 [2023-04-14 08:10:12,073] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 128.0 to 64.0 [2023-04-14 08:10:12,073] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 128.0 to 64.0 [2023-04-14 08:10:12,073] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 10 [2023-04-14 08:10:12,073] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 10 [2023-04-14 08:10:12,073] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 10 [2023-04-14 08:10:12,073] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 128.0 to 64.0 [2023-04-14 08:10:12,073] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 128.0, reducing to 64.0 [2023-04-14 08:10:12,073] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 128.0 to 64.0 [2023-04-14 08:10:12,073] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 128.0 to 64.0 [2023-04-14 08:10:12,073] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 10 [2023-04-14 08:10:12,074] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 128.0 to 64.0 [2023-04-14 08:10:14,494] [INFO] [logging.py:96:log_dist] [Rank 0] step=20, skipped=10, lr=[4.999927671816018e-05, 4.999927671816018e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:10:14,503] [INFO] [timer.py:199:stop] epoch=0/micro_step=20/global_step=20, RunningAvgSamplesPerSec=232.0205967431515, CurrSamplesPerSec=235.61416693218035, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:10:17,202] [INFO] [logging.py:96:log_dist] [Rank 0] step=30, skipped=10, lr=[4.9997106914491646e-05, 4.9997106914491646e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:10:17,210] [INFO] [timer.py:199:stop] epoch=0/micro_step=30/global_step=30, RunningAvgSamplesPerSec=233.72149514360893, CurrSamplesPerSec=237.47406265868585, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:10:19,919] [INFO] [logging.py:96:log_dist] [Rank 0] step=40, skipped=10, lr=[4.9993490714544764e-05, 4.9993490714544764e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:10:19,928] [INFO] [timer.py:199:stop] epoch=0/micro_step=40/global_step=40, RunningAvgSamplesPerSec=234.32636732206532, CurrSamplesPerSec=233.77768101225254, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:10:22,632] [INFO] [logging.py:96:log_dist] [Rank 0] step=50, skipped=10, lr=[4.998842832756209e-05, 4.998842832756209e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:10:22,641] [INFO] [timer.py:199:stop] epoch=0/micro_step=50/global_step=50, RunningAvgSamplesPerSec=234.7425540133339, CurrSamplesPerSec=233.92232260110967, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:10:25,343] [INFO] [logging.py:96:log_dist] [Rank 0] step=60, skipped=10, lr=[4.99819200464662e-05, 4.99819200464662e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:10:25,896] [INFO] [timer.py:199:stop] epoch=0/micro_step=60/global_step=60, RunningAvgSamplesPerSec=227.22204987515545, CurrSamplesPerSec=78.53451712478756, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:10:28,594] [INFO] [logging.py:96:log_dist] [Rank 0] step=70, skipped=10, lr=[4.997396624784284e-05, 4.997396624784284e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:10:28,602] [INFO] [timer.py:199:stop] epoch=0/micro_step=70/global_step=70, RunningAvgSamplesPerSec=228.60499860102635, CurrSamplesPerSec=237.33737684886947, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:10:31,295] [INFO] [logging.py:96:log_dist] [Rank 0] step=80, skipped=10, lr=[4.996456739191905e-05, 4.996456739191905e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:10:31,304] [INFO] [timer.py:199:stop] epoch=0/micro_step=80/global_step=80, RunningAvgSamplesPerSec=229.69156816751345, CurrSamplesPerSec=238.59911399335851, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:10:34,006] [INFO] [logging.py:96:log_dist] [Rank 0] step=90, skipped=10, lr=[4.9953724022536573e-05, 4.9953724022536573e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:10:34,015] [INFO] [timer.py:199:stop] epoch=0/micro_step=90/global_step=90, RunningAvgSamplesPerSec=230.453017235139, CurrSamplesPerSec=236.88310074673623, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:10:36,739] [INFO] [logging.py:96:log_dist] [Rank 0] step=100, skipped=10, lr=[4.9941436767120384e-05, 4.9941436767120384e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:10:37,443] [INFO] [timer.py:199:stop] epoch=0/micro_step=100/global_step=100, RunningAvgSamplesPerSec=225.11673053111124, CurrSamplesPerSec=64.79504899900769, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:10:40,140] [INFO] [logging.py:96:log_dist] [Rank 0] step=110, skipped=10, lr=[4.9927706336642385e-05, 4.9927706336642385e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:10:40,149] [INFO] [timer.py:199:stop] epoch=0/micro_step=110/global_step=110, RunningAvgSamplesPerSec=226.1707287783656, CurrSamplesPerSec=237.86862418620066, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:10:40,672] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:10:40,672] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 64.0 to 128.0 [2023-04-14 08:10:40,672] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:10:40,672] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:10:40,672] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 64.0 to 128.0 [2023-04-14 08:10:40,672] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:10:40,672] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 64.0 to 128.0 [2023-04-14 08:10:40,672] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 64.0 to 128.0 [2023-04-14 08:10:40,672] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:10:40,672] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:10:40,672] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 64.0 to 128.0 [2023-04-14 08:10:40,672] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 64.0 to 128.0 [2023-04-14 08:10:40,672] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:10:40,672] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 64.0 to 128.0 [2023-04-14 08:10:40,672] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:10:40,673] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 64.0 to 128.0 [2023-04-14 08:10:42,845] [INFO] [logging.py:96:log_dist] [Rank 0] step=120, skipped=10, lr=[4.991253352558025e-05, 4.991253352558025e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:10:42,853] [INFO] [timer.py:199:stop] epoch=0/micro_step=120/global_step=120, RunningAvgSamplesPerSec=227.0564841374241, CurrSamplesPerSec=238.43041842386688, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:10:45,561] [INFO] [logging.py:96:log_dist] [Rank 0] step=130, skipped=10, lr=[4.9895919211871465e-05, 4.9895919211871465e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:10:45,569] [INFO] [timer.py:199:stop] epoch=0/micro_step=130/global_step=130, RunningAvgSamplesPerSec=227.73782883983932, CurrSamplesPerSec=236.96465420441945, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:10:48,256] [INFO] [logging.py:96:log_dist] [Rank 0] step=140, skipped=10, lr=[4.9877864356862564e-05, 4.9877864356862564e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:10:48,469] [INFO] [timer.py:199:stop] epoch=0/micro_step=140/global_step=140, RunningAvgSamplesPerSec=227.2442619771578, CurrSamplesPerSec=134.4025987746145, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:10:51,158] [INFO] [logging.py:96:log_dist] [Rank 0] step=150, skipped=10, lr=[4.985837000525343e-05, 4.985837000525343e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:10:51,166] [INFO] [timer.py:199:stop] epoch=0/micro_step=150/global_step=150, RunningAvgSamplesPerSec=227.9244754412681, CurrSamplesPerSec=238.3006683788196, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:10:53,847] [INFO] [logging.py:96:log_dist] [Rank 0] step=160, skipped=10, lr=[4.9837437285036906e-05, 4.9837437285036906e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:10:53,856] [INFO] [timer.py:199:stop] epoch=0/micro_step=160/global_step=160, RunningAvgSamplesPerSec=228.56118763734557, CurrSamplesPerSec=239.10578409924963, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:10:56,545] [INFO] [logging.py:96:log_dist] [Rank 0] step=170, skipped=10, lr=[4.981506740743351e-05, 4.981506740743351e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:10:56,554] [INFO] [timer.py:199:stop] epoch=0/micro_step=170/global_step=170, RunningAvgSamplesPerSec=229.08598381870283, CurrSamplesPerSec=238.2570972864113, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:10:59,234] [INFO] [logging.py:96:log_dist] [Rank 0] step=180, skipped=10, lr=[4.979126166682133e-05, 4.979126166682133e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:10:59,243] [INFO] [timer.py:199:stop] epoch=0/micro_step=180/global_step=180, RunningAvgSamplesPerSec=229.59282015130782, CurrSamplesPerSec=238.63390002880297, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:11:01,626] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 188 [2023-04-14 08:11:01,626] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 128.0 to 64.0 [2023-04-14 08:11:01,626] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 188 [2023-04-14 08:11:01,626] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 188 [2023-04-14 08:11:01,626] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 188 [2023-04-14 08:11:01,626] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 188 [2023-04-14 08:11:01,627] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 188 [2023-04-14 08:11:01,627] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 188 [2023-04-14 08:11:01,627] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 128.0 to 64.0 [2023-04-14 08:11:01,627] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 128.0 to 64.0 [2023-04-14 08:11:01,627] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 188 [2023-04-14 08:11:01,627] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 128.0 to 64.0 [2023-04-14 08:11:01,627] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 128.0 to 64.0 [2023-04-14 08:11:01,627] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 128.0 to 64.0 [2023-04-14 08:11:01,627] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 128.0 to 64.0 [2023-04-14 08:11:01,627] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 128.0, reducing to 64.0 [2023-04-14 08:11:01,627] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 128.0 to 64.0 [2023-04-14 08:11:01,887] [INFO] [logging.py:96:log_dist] [Rank 0] step=190, skipped=11, lr=[4.976860997415379e-05, 4.976860997415379e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:11:01,896] [INFO] [timer.py:199:stop] epoch=0/micro_step=190/global_step=190, RunningAvgSamplesPerSec=230.20369631868707, CurrSamplesPerSec=238.69586125457056, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:11:04,587] [INFO] [logging.py:96:log_dist] [Rank 0] step=200, skipped=11, lr=[4.97420799573331e-05, 4.97420799573331e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:11:04,596] [INFO] [timer.py:199:stop] epoch=0/micro_step=200/global_step=200, RunningAvgSamplesPerSec=230.56446279331828, CurrSamplesPerSec=236.72621608202837, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:11:07,282] [INFO] [logging.py:96:log_dist] [Rank 0] step=210, skipped=11, lr=[4.971411830074341e-05, 4.971411830074341e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:11:07,291] [INFO] [timer.py:199:stop] epoch=0/micro_step=210/global_step=210, RunningAvgSamplesPerSec=230.9072210450696, CurrSamplesPerSec=238.27401620652896, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:11:09,980] [INFO] [logging.py:96:log_dist] [Rank 0] step=220, skipped=11, lr=[4.968472662231739e-05, 4.968472662231739e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:11:09,988] [INFO] [timer.py:199:stop] epoch=0/micro_step=220/global_step=220, RunningAvgSamplesPerSec=231.2127708150853, CurrSamplesPerSec=237.54761474936905, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:11:12,731] [INFO] [logging.py:96:log_dist] [Rank 0] step=230, skipped=11, lr=[4.965390662273243e-05, 4.965390662273243e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:11:12,740] [INFO] [timer.py:199:stop] epoch=0/micro_step=230/global_step=230, RunningAvgSamplesPerSec=231.29141684266767, CurrSamplesPerSec=238.99657489425087, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:11:15,424] [INFO] [logging.py:96:log_dist] [Rank 0] step=240, skipped=11, lr=[4.9621660085312186e-05, 4.9621660085312186e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:11:15,433] [INFO] [timer.py:199:stop] epoch=0/micro_step=240/global_step=240, RunningAvgSamplesPerSec=231.57088254715143, CurrSamplesPerSec=237.77612667435528, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:11:18,123] [INFO] [logging.py:96:log_dist] [Rank 0] step=250, skipped=11, lr=[4.958798887592347e-05, 4.958798887592347e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:11:18,132] [INFO] [timer.py:199:stop] epoch=0/micro_step=250/global_step=250, RunningAvgSamplesPerSec=231.8076417563576, CurrSamplesPerSec=236.27073895267728, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:11:20,815] [INFO] [logging.py:96:log_dist] [Rank 0] step=260, skipped=11, lr=[4.955289494286822e-05, 4.955289494286822e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:11:20,824] [INFO] [timer.py:199:stop] epoch=0/micro_step=260/global_step=260, RunningAvgSamplesPerSec=232.04845805267215, CurrSamplesPerSec=238.48528187564688, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:11:23,505] [INFO] [logging.py:96:log_dist] [Rank 0] step=270, skipped=11, lr=[4.9516380316770804e-05, 4.9516380316770804e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:11:23,556] [INFO] [timer.py:199:stop] epoch=0/micro_step=270/global_step=270, RunningAvgSamplesPerSec=232.14744581378477, CurrSamplesPerSec=206.39910807994985, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:11:26,239] [INFO] [logging.py:96:log_dist] [Rank 0] step=280, skipped=11, lr=[4.947844711046048e-05, 4.947844711046048e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:11:26,248] [INFO] [timer.py:199:stop] epoch=0/micro_step=280/global_step=280, RunningAvgSamplesPerSec=232.35933987591557, CurrSamplesPerSec=238.52278995706445, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:11:28,919] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:11:28,919] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:11:28,919] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:11:28,920] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 64.0 to 128.0 [2023-04-14 08:11:28,920] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:11:28,920] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 64.0 to 128.0 [2023-04-14 08:11:28,920] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 64.0 to 128.0 [2023-04-14 08:11:28,920] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 64.0 to 128.0 [2023-04-14 08:11:28,920] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:11:28,920] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 64.0 to 128.0 [2023-04-14 08:11:28,920] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:11:28,920] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:11:28,920] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 64.0 to 128.0 [2023-04-14 08:11:28,920] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 64.0 to 128.0 [2023-04-14 08:11:28,920] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:11:28,920] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 64.0 to 128.0 [2023-04-14 08:11:28,931] [INFO] [logging.py:96:log_dist] [Rank 0] step=290, skipped=11, lr=[4.943909751884919e-05, 4.943909751884919e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:11:28,940] [INFO] [timer.py:199:stop] epoch=0/micro_step=290/global_step=290, RunningAvgSamplesPerSec=232.55903828227068, CurrSamplesPerSec=238.25646287364668, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:11:31,334] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 298 [2023-04-14 08:11:31,334] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 298 [2023-04-14 08:11:31,334] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 298 [2023-04-14 08:11:31,334] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 128.0 to 64.0 [2023-04-14 08:11:31,334] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 298 [2023-04-14 08:11:31,334] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 128.0 to 64.0 [2023-04-14 08:11:31,334] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 298 [2023-04-14 08:11:31,334] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 128.0 to 64.0 [2023-04-14 08:11:31,334] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 298 [2023-04-14 08:11:31,334] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 298 [2023-04-14 08:11:31,334] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 128.0 to 64.0 [2023-04-14 08:11:31,334] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 298 [2023-04-14 08:11:31,334] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 128.0 to 64.0 [2023-04-14 08:11:31,334] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 128.0 to 64.0 [2023-04-14 08:11:31,334] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 128.0 to 64.0 [2023-04-14 08:11:31,334] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 128.0 to 64.0 [2023-04-14 08:11:31,334] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 128.0, reducing to 64.0 [2023-04-14 08:11:31,595] [INFO] [logging.py:96:log_dist] [Rank 0] step=300, skipped=12, lr=[4.940247375710648e-05, 4.940247375710648e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:11:31,604] [INFO] [timer.py:199:stop] epoch=0/micro_step=300/global_step=300, RunningAvgSamplesPerSec=232.8243938393462, CurrSamplesPerSec=238.84197938613974, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:11:34,287] [INFO] [logging.py:96:log_dist] [Rank 0] step=310, skipped=12, lr=[4.936043937382387e-05, 4.936043937382387e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:11:34,731] [INFO] [timer.py:199:stop] epoch=0/micro_step=310/global_step=310, RunningAvgSamplesPerSec=231.8036741532554, CurrSamplesPerSec=90.92242435450726, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:11:37,423] [INFO] [logging.py:96:log_dist] [Rank 0] step=320, skipped=12, lr=[4.931699543346854e-05, 4.931699543346854e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:11:37,432] [INFO] [timer.py:199:stop] epoch=0/micro_step=320/global_step=320, RunningAvgSamplesPerSec=231.9746651225798, CurrSamplesPerSec=239.0642600718345, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:11:40,137] [INFO] [logging.py:96:log_dist] [Rank 0] step=330, skipped=12, lr=[4.9272144449817517e-05, 4.9272144449817517e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:11:40,145] [INFO] [timer.py:199:stop] epoch=0/micro_step=330/global_step=330, RunningAvgSamplesPerSec=232.11083924972127, CurrSamplesPerSec=236.97574144453125, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:11:42,877] [INFO] [logging.py:96:log_dist] [Rank 0] step=340, skipped=12, lr=[4.922588901806297e-05, 4.922588901806297e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:11:42,887] [INFO] [timer.py:199:stop] epoch=0/micro_step=340/global_step=340, RunningAvgSamplesPerSec=232.1780222338736, CurrSamplesPerSec=215.14509235029394, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:11:45,607] [INFO] [logging.py:96:log_dist] [Rank 0] step=350, skipped=12, lr=[4.9178231814662e-05, 4.9178231814662e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:11:45,991] [INFO] [timer.py:199:stop] epoch=0/micro_step=350/global_step=350, RunningAvgSamplesPerSec=231.35294428562366, CurrSamplesPerSec=99.41369076056242, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:11:48,723] [INFO] [logging.py:96:log_dist] [Rank 0] step=360, skipped=12, lr=[4.9129175597181784e-05, 4.9129175597181784e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:11:48,732] [INFO] [timer.py:199:stop] epoch=0/micro_step=360/global_step=360, RunningAvgSamplesPerSec=231.4312551657723, CurrSamplesPerSec=237.5560236037905, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:11:51,419] [INFO] [logging.py:96:log_dist] [Rank 0] step=370, skipped=12, lr=[4.9078723204140034e-05, 4.9078723204140034e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:11:51,428] [INFO] [timer.py:199:stop] epoch=0/micro_step=370/global_step=370, RunningAvgSamplesPerSec=231.60336623807208, CurrSamplesPerSec=238.18458464285018, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:11:54,122] [INFO] [logging.py:96:log_dist] [Rank 0] step=380, skipped=12, lr=[4.902687755484071e-05, 4.902687755484071e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:11:54,131] [INFO] [timer.py:199:stop] epoch=0/micro_step=380/global_step=380, RunningAvgSamplesPerSec=231.74847821066203, CurrSamplesPerSec=236.23331206575642, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:11:56,823] [INFO] [logging.py:96:log_dist] [Rank 0] step=390, skipped=12, lr=[4.897364164920514e-05, 4.897364164920514e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:11:56,832] [INFO] [timer.py:199:stop] epoch=0/micro_step=390/global_step=390, RunningAvgSamplesPerSec=231.8917896408389, CurrSamplesPerSec=236.52847439018694, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:11:59,510] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:11:59,510] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:11:59,510] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 64.0 to 128.0 [2023-04-14 08:11:59,510] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 64.0 to 128.0 [2023-04-14 08:11:59,510] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:11:59,510] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:11:59,510] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 64.0 to 128.0 [2023-04-14 08:11:59,510] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 64.0 to 128.0 [2023-04-14 08:11:59,510] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:11:59,510] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 64.0 to 128.0 [2023-04-14 08:11:59,510] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:11:59,510] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 64.0 to 128.0 [2023-04-14 08:11:59,510] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:11:59,510] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:11:59,511] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 64.0 to 128.0 [2023-04-14 08:11:59,511] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 64.0 to 128.0 [2023-04-14 08:11:59,521] [INFO] [logging.py:96:log_dist] [Rank 0] step=400, skipped=12, lr=[4.891901856759844e-05, 4.891901856759844e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:11:59,530] [INFO] [timer.py:199:stop] epoch=0/micro_step=400/global_step=400, RunningAvgSamplesPerSec=232.0346566162929, CurrSamplesPerSec=238.3707116255197, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:12:02,217] [INFO] [logging.py:96:log_dist] [Rank 0] step=410, skipped=12, lr=[4.8863011470651234e-05, 4.8863011470651234e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:12:02,230] [INFO] [timer.py:199:stop] epoch=0/micro_step=410/global_step=410, RunningAvgSamplesPerSec=232.17445182165847, CurrSamplesPerSec=238.28818765112968, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:12:04,087] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 416 [2023-04-14 08:12:04,087] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 416 [2023-04-14 08:12:04,087] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 416 [2023-04-14 08:12:04,087] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 128.0 to 64.0 [2023-04-14 08:12:04,087] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 416 [2023-04-14 08:12:04,087] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 416 [2023-04-14 08:12:04,087] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 416 [2023-04-14 08:12:04,087] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 128.0 to 64.0 [2023-04-14 08:12:04,087] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 416 [2023-04-14 08:12:04,087] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 128.0 to 64.0 [2023-04-14 08:12:04,087] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 128.0 to 64.0 [2023-04-14 08:12:04,088] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 128.0 to 64.0 [2023-04-14 08:12:04,088] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 128.0 to 64.0 [2023-04-14 08:12:04,087] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 416 [2023-04-14 08:12:04,088] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 128.0 to 64.0 [2023-04-14 08:12:04,088] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 128.0 to 64.0 [2023-04-14 08:12:04,088] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 128.0, reducing to 64.0 [2023-04-14 08:12:04,320] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 417 [2023-04-14 08:12:04,320] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 417 [2023-04-14 08:12:04,320] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 64.0 to 32.0 [2023-04-14 08:12:04,320] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 417 [2023-04-14 08:12:04,321] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 64.0 to 32.0 [2023-04-14 08:12:04,321] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 64.0 to 32.0 [2023-04-14 08:12:04,320] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 417 [2023-04-14 08:12:04,321] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 417 [2023-04-14 08:12:04,321] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 417 [2023-04-14 08:12:04,321] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 64.0 to 32.0 [2023-04-14 08:12:04,321] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 64.0 to 32.0 [2023-04-14 08:12:04,321] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 64.0 to 32.0 [2023-04-14 08:12:04,321] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 417 [2023-04-14 08:12:04,321] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 64.0 to 32.0 [2023-04-14 08:12:04,321] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 64.0, reducing to 32.0 [2023-04-14 08:12:04,321] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 417 [2023-04-14 08:12:04,321] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 64.0 to 32.0 [2023-04-14 08:12:04,851] [INFO] [logging.py:96:log_dist] [Rank 0] step=420, skipped=14, lr=[4.881721147712162e-05, 4.881721147712162e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:12:04,860] [INFO] [timer.py:199:stop] epoch=0/micro_step=420/global_step=420, RunningAvgSamplesPerSec=232.4416087472903, CurrSamplesPerSec=237.9325084204928, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:12:07,539] [INFO] [logging.py:96:log_dist] [Rank 0] step=430, skipped=14, lr=[4.875872137285494e-05, 4.875872137285494e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:12:07,599] [INFO] [timer.py:199:stop] epoch=0/micro_step=430/global_step=430, RunningAvgSamplesPerSec=232.4804815385331, CurrSamplesPerSec=200.80510115230746, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:12:10,275] [INFO] [logging.py:96:log_dist] [Rank 0] step=440, skipped=14, lr=[4.869885652845176e-05, 4.869885652845176e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:12:10,284] [INFO] [timer.py:199:stop] epoch=0/micro_step=440/global_step=440, RunningAvgSamplesPerSec=232.62303695754147, CurrSamplesPerSec=239.30850048096082, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:12:12,963] [INFO] [logging.py:96:log_dist] [Rank 0] step=450, skipped=14, lr=[4.863762040784446e-05, 4.863762040784446e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:12:12,972] [INFO] [timer.py:199:stop] epoch=0/micro_step=450/global_step=450, RunningAvgSamplesPerSec=232.75301771056084, CurrSamplesPerSec=238.36541991260498, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:12:15,727] [INFO] [logging.py:96:log_dist] [Rank 0] step=460, skipped=14, lr=[4.857501655431095e-05, 4.857501655431095e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:12:15,736] [INFO] [timer.py:199:stop] epoch=0/micro_step=460/global_step=460, RunningAvgSamplesPerSec=232.74334699394194, CurrSamplesPerSec=239.21253507734616, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:12:18,422] [INFO] [logging.py:96:log_dist] [Rank 0] step=470, skipped=14, lr=[4.8511048590269656e-05, 4.8511048590269656e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:12:18,431] [INFO] [timer.py:199:stop] epoch=0/micro_step=470/global_step=470, RunningAvgSamplesPerSec=232.85350583338024, CurrSamplesPerSec=238.53296367384658, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:12:21,117] [INFO] [logging.py:96:log_dist] [Rank 0] step=480, skipped=14, lr=[4.844572021706993e-05, 4.844572021706993e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:12:21,126] [INFO] [timer.py:199:stop] epoch=0/micro_step=480/global_step=480, RunningAvgSamplesPerSec=232.9592963431827, CurrSamplesPerSec=239.85676259372076, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:12:23,805] [INFO] [logging.py:96:log_dist] [Rank 0] step=490, skipped=14, lr=[4.8379035214777836e-05, 4.8379035214777836e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:12:23,814] [INFO] [timer.py:199:stop] epoch=0/micro_step=490/global_step=490, RunningAvgSamplesPerSec=233.0720234305768, CurrSamplesPerSec=238.9667885378749, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:12:26,495] [INFO] [logging.py:96:log_dist] [Rank 0] step=500, skipped=14, lr=[4.8310997441957476e-05, 4.8310997441957476e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:12:26,504] [INFO] [timer.py:199:stop] epoch=0/micro_step=500/global_step=500, RunningAvgSamplesPerSec=233.17721610864677, CurrSamplesPerSec=238.53190387118846, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:12:29,185] [INFO] [logging.py:96:log_dist] [Rank 0] step=510, skipped=14, lr=[4.824161083544769e-05, 4.824161083544769e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:12:29,194] [INFO] [timer.py:199:stop] epoch=0/micro_step=510/global_step=510, RunningAvgSamplesPerSec=233.27795973054046, CurrSamplesPerSec=238.60293147738284, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:12:31,595] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:12:31,595] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 32.0 to 64.0 [2023-04-14 08:12:31,595] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:12:31,595] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:12:31,595] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 32.0 to 64.0 [2023-04-14 08:12:31,595] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:12:31,595] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 32.0 to 64.0 [2023-04-14 08:12:31,595] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:12:32,533] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 32.0 to 64.0 [2023-04-14 08:12:31,595] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:12:32,533] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 32.0 to 64.0 [2023-04-14 08:12:31,595] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:12:32,533] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 32.0 to 64.0 [2023-04-14 08:12:31,595] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:12:32,533] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 32.0 to 64.0 [2023-04-14 08:12:32,533] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 32.0 to 64.0 [2023-04-14 08:12:32,813] [INFO] [logging.py:96:log_dist] [Rank 0] step=520, skipped=14, lr=[4.817087941013426e-05, 4.817087941013426e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:12:32,822] [INFO] [timer.py:199:stop] epoch=0/micro_step=520/global_step=520, RunningAvgSamplesPerSec=231.8445427087475, CurrSamplesPerSec=239.08555264308725, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:12:35,508] [INFO] [logging.py:96:log_dist] [Rank 0] step=530, skipped=14, lr=[4.809880725871763e-05, 4.809880725871763e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:12:35,516] [INFO] [timer.py:199:stop] epoch=0/micro_step=530/global_step=530, RunningAvgSamplesPerSec=231.9575012849858, CurrSamplesPerSec=239.2457943667018, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:12:38,191] [INFO] [logging.py:96:log_dist] [Rank 0] step=540, skipped=14, lr=[4.802539855147605e-05, 4.802539855147605e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:12:38,200] [INFO] [timer.py:199:stop] epoch=0/micro_step=540/global_step=540, RunningAvgSamplesPerSec=232.08397628128074, CurrSamplesPerSec=239.32578243131437, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:12:40,880] [INFO] [logging.py:96:log_dist] [Rank 0] step=550, skipped=14, lr=[4.795065753602433e-05, 4.795065753602433e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:12:40,889] [INFO] [timer.py:199:stop] epoch=0/micro_step=550/global_step=550, RunningAvgSamplesPerSec=232.1977746929332, CurrSamplesPerSec=237.65908040563153, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:12:43,568] [INFO] [logging.py:96:log_dist] [Rank 0] step=560, skipped=14, lr=[4.787458853706798e-05, 4.787458853706798e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:12:43,613] [INFO] [timer.py:199:stop] epoch=0/micro_step=560/global_step=560, RunningAvgSamplesPerSec=232.25510748473874, CurrSamplesPerSec=209.75797096126632, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:12:46,293] [INFO] [logging.py:96:log_dist] [Rank 0] step=570, skipped=14, lr=[4.7797195956153054e-05, 4.7797195956153054e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:12:46,301] [INFO] [timer.py:199:stop] epoch=0/micro_step=570/global_step=570, RunningAvgSamplesPerSec=232.36310451882744, CurrSamplesPerSec=238.5791802279179, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:12:49,033] [INFO] [logging.py:96:log_dist] [Rank 0] step=580, skipped=14, lr=[4.7718484271411417e-05, 4.7718484271411417e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:12:49,041] [INFO] [timer.py:199:stop] epoch=0/micro_step=580/global_step=580, RunningAvgSamplesPerSec=232.39206681870155, CurrSamplesPerSec=210.2167640605006, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:12:51,724] [INFO] [logging.py:96:log_dist] [Rank 0] step=590, skipped=14, lr=[4.763845803730164e-05, 4.763845803730164e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:12:51,733] [INFO] [timer.py:199:stop] epoch=0/micro_step=590/global_step=590, RunningAvgSamplesPerSec=232.48936747452797, CurrSamplesPerSec=238.72409398653042, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:12:54,433] [INFO] [logging.py:96:log_dist] [Rank 0] step=600, skipped=14, lr=[4.755712188434546e-05, 4.755712188434546e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:12:54,896] [INFO] [timer.py:199:stop] epoch=0/micro_step=600/global_step=600, RunningAvgSamplesPerSec=231.92845033058177, CurrSamplesPerSec=88.64557154311423, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:12:57,582] [INFO] [logging.py:96:log_dist] [Rank 0] step=610, skipped=14, lr=[4.747448051885988e-05, 4.747448051885988e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:12:57,591] [INFO] [timer.py:199:stop] epoch=0/micro_step=610/global_step=610, RunningAvgSamplesPerSec=232.02619610343774, CurrSamplesPerSec=237.135470922056, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:12:59,987] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:12:59,987] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:12:59,987] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 64.0 to 128.0 [2023-04-14 08:12:59,987] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 64.0 to 128.0 [2023-04-14 08:12:59,987] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:12:59,987] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:12:59,987] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 64.0 to 128.0 [2023-04-14 08:12:59,987] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 64.0 to 128.0 [2023-04-14 08:12:59,987] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:12:59,987] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:12:59,987] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:12:59,987] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 64.0 to 128.0 [2023-04-14 08:12:59,987] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 64.0 to 128.0 [2023-04-14 08:12:59,987] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 64.0 to 128.0 [2023-04-14 08:12:59,987] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:12:59,988] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 64.0 to 128.0 [2023-04-14 08:13:00,267] [INFO] [logging.py:96:log_dist] [Rank 0] step=620, skipped=14, lr=[4.73905387226848e-05, 4.73905387226848e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:13:00,275] [INFO] [timer.py:199:stop] epoch=0/micro_step=620/global_step=620, RunningAvgSamplesPerSec=232.13346762707084, CurrSamplesPerSec=239.0338182834937, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:13:02,956] [INFO] [logging.py:96:log_dist] [Rank 0] step=630, skipped=14, lr=[4.7305301352906376e-05, 4.7305301352906376e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:13:02,965] [INFO] [timer.py:199:stop] epoch=0/micro_step=630/global_step=630, RunningAvgSamplesPerSec=232.23126005321407, CurrSamplesPerSec=239.19420378187016, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:13:05,646] [INFO] [logging.py:96:log_dist] [Rank 0] step=640, skipped=14, lr=[4.721877334157592e-05, 4.721877334157592e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:13:05,655] [INFO] [timer.py:199:stop] epoch=0/micro_step=640/global_step=640, RunningAvgSamplesPerSec=232.32530877551514, CurrSamplesPerSec=238.77781830020476, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:13:08,340] [INFO] [logging.py:96:log_dist] [Rank 0] step=650, skipped=14, lr=[4.713095969542457e-05, 4.713095969542457e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:13:08,349] [INFO] [timer.py:199:stop] epoch=0/micro_step=650/global_step=650, RunningAvgSamplesPerSec=232.41156109492005, CurrSamplesPerSec=239.27245111339695, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:13:11,030] [INFO] [logging.py:96:log_dist] [Rank 0] step=660, skipped=14, lr=[4.704186549557359e-05, 4.704186549557359e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:13:11,039] [INFO] [timer.py:199:stop] epoch=0/micro_step=660/global_step=660, RunningAvgSamplesPerSec=232.4999508826735, CurrSamplesPerSec=236.7272598999598, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:13:13,729] [INFO] [logging.py:96:log_dist] [Rank 0] step=670, skipped=14, lr=[4.6951495897240316e-05, 4.6951495897240316e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:13:13,738] [INFO] [timer.py:199:stop] epoch=0/micro_step=670/global_step=670, RunningAvgSamplesPerSec=232.57533044833036, CurrSamplesPerSec=238.49481671020132, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:13:16,414] [INFO] [logging.py:96:log_dist] [Rank 0] step=680, skipped=14, lr=[4.685985612943988e-05, 4.685985612943988e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:13:16,576] [INFO] [timer.py:199:stop] epoch=0/micro_step=680/global_step=680, RunningAvgSamplesPerSec=232.47497856300302, CurrSamplesPerSec=151.90991641530897, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:13:19,257] [INFO] [logging.py:96:log_dist] [Rank 0] step=690, skipped=14, lr=[4.67669514946827e-05, 4.67669514946827e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:13:19,266] [INFO] [timer.py:199:stop] epoch=0/micro_step=690/global_step=690, RunningAvgSamplesPerSec=232.56006178079346, CurrSamplesPerSec=238.15098588050552, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:13:22,018] [INFO] [logging.py:96:log_dist] [Rank 0] step=700, skipped=14, lr=[4.6672787368667556e-05, 4.6672787368667556e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:13:22,027] [INFO] [timer.py:199:stop] epoch=0/micro_step=700/global_step=700, RunningAvgSamplesPerSec=232.56202209273536, CurrSamplesPerSec=239.73145050520927, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:13:24,711] [INFO] [logging.py:96:log_dist] [Rank 0] step=710, skipped=14, lr=[4.657736919997064e-05, 4.657736919997064e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:13:24,720] [INFO] [timer.py:199:stop] epoch=0/micro_step=710/global_step=710, RunningAvgSamplesPerSec=232.6391010678468, CurrSamplesPerSec=238.18395061618853, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:13:27,131] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:13:27,131] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:13:27,131] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:13:27,131] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:13:27,132] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 128.0 to 256.0 [2023-04-14 08:13:27,132] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 128.0 to 256.0 [2023-04-14 08:13:27,132] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 128.0 to 256.0 [2023-04-14 08:13:27,132] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 128.0 to 256.0 [2023-04-14 08:13:27,131] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:13:27,132] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:13:27,132] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 128.0 to 256.0 [2023-04-14 08:13:27,132] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 128.0 to 256.0 [2023-04-14 08:13:27,132] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:13:27,132] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:13:27,729] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 128.0 to 256.0 [2023-04-14 08:13:27,729] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 128.0 to 256.0 [2023-04-14 08:13:28,009] [INFO] [logging.py:96:log_dist] [Rank 0] step=720, skipped=14, lr=[4.648070250973027e-05, 4.648070250973027e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:13:28,018] [INFO] [timer.py:199:stop] epoch=0/micro_step=720/global_step=720, RunningAvgSamplesPerSec=232.0038042639153, CurrSamplesPerSec=73.89810456026525, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:13:30,698] [INFO] [logging.py:96:log_dist] [Rank 0] step=730, skipped=14, lr=[4.638279289132733e-05, 4.638279289132733e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:13:30,707] [INFO] [timer.py:199:stop] epoch=0/micro_step=730/global_step=730, RunningAvgSamplesPerSec=232.09080849445348, CurrSamplesPerSec=239.1439137271322, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:13:33,389] [INFO] [logging.py:96:log_dist] [Rank 0] step=740, skipped=14, lr=[4.6283646010061766e-05, 4.6283646010061766e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:13:33,398] [INFO] [timer.py:199:stop] epoch=0/micro_step=740/global_step=740, RunningAvgSamplesPerSec=232.17384083699017, CurrSamplesPerSec=238.71942344957364, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:13:36,080] [INFO] [logging.py:96:log_dist] [Rank 0] step=750, skipped=14, lr=[4.618326760282465e-05, 4.618326760282465e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:13:36,089] [INFO] [timer.py:199:stop] epoch=0/micro_step=750/global_step=750, RunningAvgSamplesPerSec=232.2532210127636, CurrSamplesPerSec=238.51261710808774, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:13:38,769] [INFO] [logging.py:96:log_dist] [Rank 0] step=760, skipped=14, lr=[4.6081663477766334e-05, 4.6081663477766334e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:13:38,778] [INFO] [timer.py:199:stop] epoch=0/micro_step=760/global_step=760, RunningAvgSamplesPerSec=232.33376961430736, CurrSamplesPerSec=239.02828424580687, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:13:41,456] [INFO] [logging.py:96:log_dist] [Rank 0] step=770, skipped=14, lr=[4.597883951396027e-05, 4.597883951396027e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:13:41,465] [INFO] [timer.py:199:stop] epoch=0/micro_step=770/global_step=770, RunningAvgSamplesPerSec=232.41356734540975, CurrSamplesPerSec=238.61183941517598, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:13:44,142] [INFO] [logging.py:96:log_dist] [Rank 0] step=780, skipped=14, lr=[4.587480166106294e-05, 4.587480166106294e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:13:44,151] [INFO] [timer.py:199:stop] epoch=0/micro_step=780/global_step=780, RunningAvgSamplesPerSec=232.49307052700033, CurrSamplesPerSec=238.74596302238794, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:13:46,832] [INFO] [logging.py:96:log_dist] [Rank 0] step=790, skipped=14, lr=[4.57695559389695e-05, 4.57695559389695e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:13:46,841] [INFO] [timer.py:199:stop] epoch=0/micro_step=790/global_step=790, RunningAvgSamplesPerSec=232.5663569017472, CurrSamplesPerSec=238.75912883996534, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:13:49,518] [INFO] [logging.py:96:log_dist] [Rank 0] step=800, skipped=14, lr=[4.566310843746551e-05, 4.566310843746551e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:13:49,573] [INFO] [timer.py:199:stop] epoch=0/micro_step=800/global_step=800, RunningAvgSamplesPerSec=232.59320257452572, CurrSamplesPerSec=204.44685997696848, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:13:52,251] [INFO] [logging.py:96:log_dist] [Rank 0] step=810, skipped=14, lr=[4.5555465315874556e-05, 4.5555465315874556e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:13:52,260] [INFO] [timer.py:199:stop] epoch=0/micro_step=810/global_step=810, RunningAvgSamplesPerSec=232.66620385694617, CurrSamplesPerSec=238.08719545671846, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:13:54,675] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:13:54,675] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 256.0 to 512.0 [2023-04-14 08:13:54,675] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:13:54,675] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:13:54,675] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 256.0 to 512.0 [2023-04-14 08:13:54,675] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:13:54,675] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 256.0 to 512.0 [2023-04-14 08:13:54,675] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 256.0 to 512.0 [2023-04-14 08:13:54,675] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:13:54,675] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:13:54,675] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 256.0 to 512.0 [2023-04-14 08:13:54,675] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 256.0 to 512.0 [2023-04-14 08:13:54,676] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:13:54,676] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 256.0 to 512.0 [2023-04-14 08:13:54,676] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:13:54,676] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 256.0 to 512.0 [2023-04-14 08:13:54,956] [INFO] [logging.py:96:log_dist] [Rank 0] step=820, skipped=14, lr=[4.5446632802701845e-05, 4.5446632802701845e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:13:54,965] [INFO] [timer.py:199:stop] epoch=0/micro_step=820/global_step=820, RunningAvgSamplesPerSec=232.71992141064632, CurrSamplesPerSec=237.86483016266422, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:13:57,668] [INFO] [logging.py:96:log_dist] [Rank 0] step=830, skipped=14, lr=[4.533661719527379e-05, 4.533661719527379e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:13:57,677] [INFO] [timer.py:199:stop] epoch=0/micro_step=830/global_step=830, RunningAvgSamplesPerSec=232.76465056028493, CurrSamplesPerSec=238.6742136083237, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:14:00,391] [INFO] [logging.py:96:log_dist] [Rank 0] step=840, skipped=14, lr=[4.522542485937369e-05, 4.522542485937369e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:14:00,943] [INFO] [timer.py:199:stop] epoch=0/micro_step=840/global_step=840, RunningAvgSamplesPerSec=232.24883415915974, CurrSamplesPerSec=78.89081237399415, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:14:03,622] [INFO] [logging.py:96:log_dist] [Rank 0] step=850, skipped=14, lr=[4.5113062228873306e-05, 4.5113062228873306e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:14:03,631] [INFO] [timer.py:199:stop] epoch=0/micro_step=850/global_step=850, RunningAvgSamplesPerSec=232.32149517800812, CurrSamplesPerSec=238.7092338214488, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:14:06,310] [INFO] [logging.py:96:log_dist] [Rank 0] step=860, skipped=14, lr=[4.499953580536065e-05, 4.499953580536065e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:14:06,318] [INFO] [timer.py:199:stop] epoch=0/micro_step=860/global_step=860, RunningAvgSamplesPerSec=232.39312142674478, CurrSamplesPerSec=238.7476617559688, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:14:09,001] [INFO] [logging.py:96:log_dist] [Rank 0] step=870, skipped=14, lr=[4.488485215776377e-05, 4.488485215776377e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:14:09,010] [INFO] [timer.py:199:stop] epoch=0/micro_step=870/global_step=870, RunningAvgSamplesPerSec=232.45892085185358, CurrSamplesPerSec=239.0474416218883, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:14:11,687] [INFO] [logging.py:96:log_dist] [Rank 0] step=880, skipped=14, lr=[4.4769017921970634e-05, 4.4769017921970634e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:14:12,235] [INFO] [timer.py:199:stop] epoch=0/micro_step=880/global_step=880, RunningAvgSamplesPerSec=232.01116150831726, CurrSamplesPerSec=79.36282082780099, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:14:14,913] [INFO] [logging.py:96:log_dist] [Rank 0] step=890, skipped=14, lr=[4.465203980044517e-05, 4.465203980044517e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:14:14,922] [INFO] [timer.py:199:stop] epoch=0/micro_step=890/global_step=890, RunningAvgSamplesPerSec=232.08395122542294, CurrSamplesPerSec=238.8366667289482, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:14:17,598] [INFO] [logging.py:96:log_dist] [Rank 0] step=900, skipped=14, lr=[4.453392456183946e-05, 4.453392456183946e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:14:17,607] [INFO] [timer.py:199:stop] epoch=0/micro_step=900/global_step=900, RunningAvgSamplesPerSec=232.1565495204714, CurrSamplesPerSec=238.27930386185176, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:14:20,282] [INFO] [logging.py:96:log_dist] [Rank 0] step=910, skipped=14, lr=[4.4414679040602066e-05, 4.4414679040602066e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:14:20,291] [INFO] [timer.py:199:stop] epoch=0/micro_step=910/global_step=910, RunningAvgSamplesPerSec=232.22854254787003, CurrSamplesPerSec=238.9650866846193, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:14:22,696] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:14:22,696] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:14:22,696] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 512.0 to 1024.0 [2023-04-14 08:14:22,696] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 512.0 to 1024.0 [2023-04-14 08:14:22,696] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:14:22,696] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:14:22,696] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:14:22,696] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 512.0 to 1024.0 [2023-04-14 08:14:22,696] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 512.0 to 1024.0 [2023-04-14 08:14:22,696] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 512.0 to 1024.0 [2023-04-14 08:14:22,696] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:14:22,696] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 512.0 to 1024.0 [2023-04-14 08:14:22,696] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:14:22,696] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:14:22,697] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 512.0 to 1024.0 [2023-04-14 08:14:22,697] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 512.0 to 1024.0 [2023-04-14 08:14:22,976] [INFO] [logging.py:96:log_dist] [Rank 0] step=920, skipped=14, lr=[4.4294310136582593e-05, 4.4294310136582593e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:14:22,985] [INFO] [timer.py:199:stop] epoch=0/micro_step=920/global_step=920, RunningAvgSamplesPerSec=232.29054580143293, CurrSamplesPerSec=238.59020700566177, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:14:25,671] [INFO] [logging.py:96:log_dist] [Rank 0] step=930, skipped=14, lr=[4.417282481463243e-05, 4.417282481463243e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:14:25,679] [INFO] [timer.py:199:stop] epoch=0/micro_step=930/global_step=930, RunningAvgSamplesPerSec=232.3516410002271, CurrSamplesPerSec=237.90678192927197, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:14:28,370] [INFO] [logging.py:96:log_dist] [Rank 0] step=940, skipped=14, lr=[4.405023010420174e-05, 4.405023010420174e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:14:28,379] [INFO] [timer.py:199:stop] epoch=0/micro_step=940/global_step=940, RunningAvgSamplesPerSec=232.4054101358002, CurrSamplesPerSec=239.46540966905684, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:14:31,087] [INFO] [logging.py:96:log_dist] [Rank 0] step=950, skipped=14, lr=[4.3926533098932754e-05, 4.3926533098932754e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:14:31,096] [INFO] [timer.py:199:stop] epoch=0/micro_step=950/global_step=950, RunningAvgSamplesPerSec=232.4430686884421, CurrSamplesPerSec=238.61735418618585, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:14:33,777] [INFO] [logging.py:96:log_dist] [Rank 0] step=960, skipped=14, lr=[4.380174095624927e-05, 4.380174095624927e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:14:33,785] [INFO] [timer.py:199:stop] epoch=0/micro_step=960/global_step=960, RunningAvgSamplesPerSec=232.50360354358332, CurrSamplesPerSec=239.38468782438406, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:14:36,467] [INFO] [logging.py:96:log_dist] [Rank 0] step=970, skipped=14, lr=[4.3675860896942524e-05, 4.3675860896942524e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:14:36,960] [INFO] [timer.py:199:stop] epoch=0/micro_step=970/global_step=970, RunningAvgSamplesPerSec=232.14066869561137, CurrSamplesPerSec=85.16953161041165, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:14:39,643] [INFO] [logging.py:96:log_dist] [Rank 0] step=980, skipped=14, lr=[4.35489002047534e-05, 4.35489002047534e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:14:39,652] [INFO] [timer.py:199:stop] epoch=0/micro_step=980/global_step=980, RunningAvgSamplesPerSec=232.20148216948098, CurrSamplesPerSec=239.60370283425166, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:14:42,329] [INFO] [logging.py:96:log_dist] [Rank 0] step=990, skipped=14, lr=[4.342086622595094e-05, 4.342086622595094e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:14:42,337] [INFO] [timer.py:199:stop] epoch=0/micro_step=990/global_step=990, RunningAvgSamplesPerSec=232.26603442994033, CurrSamplesPerSec=238.4717225351775, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:14:45,013] [INFO] [logging.py:96:log_dist] [Rank 0] step=1000, skipped=14, lr=[4.3291766368907275e-05, 4.3291766368907275e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:14:45,022] [INFO] [timer.py:199:stop] epoch=0/micro_step=1000/global_step=1000, RunningAvgSamplesPerSec=232.33050999163882, CurrSamplesPerSec=238.9276520448135, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:14:47,700] [INFO] [logging.py:96:log_dist] [Rank 0] step=1010, skipped=14, lr=[4.3161608103668985e-05, 4.3161608103668985e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:14:48,408] [INFO] [timer.py:199:stop] epoch=0/micro_step=1010/global_step=1010, RunningAvgSamplesPerSec=231.8075758889048, CurrSamplesPerSec=66.13973499800794, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:14:50,808] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:14:50,808] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 1024.0 to 2048.0 [2023-04-14 08:14:50,808] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:14:50,808] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:14:50,808] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 1024.0 to 2048.0 [2023-04-14 08:14:50,808] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 1024.0 to 2048.0 [2023-04-14 08:14:50,808] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:14:50,808] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:14:50,808] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 1024.0 to 2048.0 [2023-04-14 08:14:50,809] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 1024.0 to 2048.0 [2023-04-14 08:14:50,808] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:14:50,808] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:14:50,809] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 1024.0 to 2048.0 [2023-04-14 08:14:50,809] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 1024.0 to 2048.0 [2023-04-14 08:14:50,809] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:14:50,809] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 1024.0 to 2048.0 [2023-04-14 08:14:51,088] [INFO] [logging.py:96:log_dist] [Rank 0] step=1020, skipped=14, lr=[4.303039896152482e-05, 4.303039896152482e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:14:51,097] [INFO] [timer.py:199:stop] epoch=0/micro_step=1020/global_step=1020, RunningAvgSamplesPerSec=231.87126547538696, CurrSamplesPerSec=238.9674267390952, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:14:53,775] [INFO] [logging.py:96:log_dist] [Rank 0] step=1030, skipped=14, lr=[4.2898146534569975e-05, 4.2898146534569975e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:14:53,784] [INFO] [timer.py:199:stop] epoch=0/micro_step=1030/global_step=1030, RunningAvgSamplesPerSec=231.9353262305565, CurrSamplesPerSec=238.45498735040792, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:14:56,470] [INFO] [logging.py:96:log_dist] [Rank 0] step=1040, skipped=14, lr=[4.276485847526673e-05, 4.276485847526673e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:14:56,510] [INFO] [timer.py:199:stop] epoch=0/micro_step=1040/global_step=1040, RunningAvgSamplesPerSec=231.966548378819, CurrSamplesPerSec=214.28773189152597, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:14:59,196] [INFO] [logging.py:96:log_dist] [Rank 0] step=1050, skipped=14, lr=[4.263054249600172e-05, 4.263054249600172e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:14:59,204] [INFO] [timer.py:199:stop] epoch=0/micro_step=1050/global_step=1050, RunningAvgSamplesPerSec=232.02244255581908, CurrSamplesPerSec=238.22855205635793, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:15:01,931] [INFO] [logging.py:96:log_dist] [Rank 0] step=1060, skipped=14, lr=[4.249520636863962e-05, 4.249520636863962e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:15:01,940] [INFO] [timer.py:199:stop] epoch=0/micro_step=1060/global_step=1060, RunningAvgSamplesPerSec=232.04457661840087, CurrSamplesPerSec=238.6105668119119, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:15:04,617] [INFO] [logging.py:96:log_dist] [Rank 0] step=1070, skipped=14, lr=[4.2358857924073495e-05, 4.2358857924073495e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:15:04,626] [INFO] [timer.py:199:stop] epoch=0/micro_step=1070/global_step=1070, RunningAvgSamplesPerSec=232.10539505612394, CurrSamplesPerSec=238.8634449276252, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:15:07,383] [INFO] [logging.py:96:log_dist] [Rank 0] step=1080, skipped=14, lr=[4.2221505051771646e-05, 4.2221505051771646e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:15:07,392] [INFO] [timer.py:199:stop] epoch=0/micro_step=1080/global_step=1080, RunningAvgSamplesPerSec=232.10314852207046, CurrSamplesPerSec=239.04041689411392, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:15:10,073] [INFO] [logging.py:96:log_dist] [Rank 0] step=1090, skipped=14, lr=[4.208315569932113e-05, 4.208315569932113e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:15:10,832] [INFO] [timer.py:199:stop] epoch=0/micro_step=1090/global_step=1090, RunningAvgSamplesPerSec=231.580266665431, CurrSamplesPerSec=62.88201364528125, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:15:13,512] [INFO] [logging.py:96:log_dist] [Rank 0] step=1100, skipped=14, lr=[4.1943817871967886e-05, 4.1943817871967886e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:15:13,521] [INFO] [timer.py:199:stop] epoch=0/micro_step=1100/global_step=1100, RunningAvgSamplesPerSec=231.64135553714866, CurrSamplesPerSec=238.98593605744875, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:15:16,208] [INFO] [logging.py:96:log_dist] [Rank 0] step=1110, skipped=14, lr=[4.180349963215352e-05, 4.180349963215352e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:15:16,217] [INFO] [timer.py:199:stop] epoch=0/micro_step=1110/global_step=1110, RunningAvgSamplesPerSec=231.69550734632847, CurrSamplesPerSec=236.28800090136718, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:15:18,613] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:15:18,613] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:15:18,613] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:15:18,613] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-04-14 08:15:18,613] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-04-14 08:15:18,613] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:15:18,613] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-04-14 08:15:18,613] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-04-14 08:15:18,613] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:15:18,613] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:15:18,613] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:15:18,613] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-04-14 08:15:18,613] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-04-14 08:15:18,613] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-04-14 08:15:18,613] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:15:18,613] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0 [2023-04-14 08:15:18,893] [INFO] [logging.py:96:log_dist] [Rank 0] step=1120, skipped=14, lr=[4.16622090990488e-05, 4.16622090990488e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:15:18,902] [INFO] [timer.py:199:stop] epoch=0/micro_step=1120/global_step=1120, RunningAvgSamplesPerSec=231.75759499449538, CurrSamplesPerSec=238.75339516027896, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:15:21,580] [INFO] [logging.py:96:log_dist] [Rank 0] step=1130, skipped=14, lr=[4.151995444808387e-05, 4.151995444808387e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:15:21,589] [INFO] [timer.py:199:stop] epoch=0/micro_step=1130/global_step=1130, RunningAvgSamplesPerSec=231.81656486696875, CurrSamplesPerSec=236.022280233283, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:15:24,268] [INFO] [logging.py:96:log_dist] [Rank 0] step=1140, skipped=14, lr=[4.1376743910475184e-05, 4.1376743910475184e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:15:24,277] [INFO] [timer.py:199:stop] epoch=0/micro_step=1140/global_step=1140, RunningAvgSamplesPerSec=231.87433117130746, CurrSamplesPerSec=238.60038647446052, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:15:26,952] [INFO] [logging.py:96:log_dist] [Rank 0] step=1150, skipped=14, lr=[4.123258577274923e-05, 4.123258577274923e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:15:26,961] [INFO] [timer.py:199:stop] epoch=0/micro_step=1150/global_step=1150, RunningAvgSamplesPerSec=231.93346765078377, CurrSamplesPerSec=238.4335951552145, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:15:29,644] [INFO] [logging.py:96:log_dist] [Rank 0] step=1160, skipped=14, lr=[4.1087488376263056e-05, 4.1087488376263056e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:15:29,653] [INFO] [timer.py:199:stop] epoch=0/micro_step=1160/global_step=1160, RunningAvgSamplesPerSec=231.986215017516, CurrSamplesPerSec=238.59678114658877, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:15:32,337] [INFO] [logging.py:96:log_dist] [Rank 0] step=1170, skipped=14, lr=[4.0941460116721606e-05, 4.0941460116721606e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:15:32,346] [INFO] [timer.py:199:stop] epoch=0/micro_step=1170/global_step=1170, RunningAvgSamplesPerSec=232.03776091834138, CurrSamplesPerSec=238.88448915372732, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:15:35,031] [INFO] [logging.py:96:log_dist] [Rank 0] step=1180, skipped=14, lr=[4.079450944369195e-05, 4.079450944369195e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:15:35,087] [INFO] [timer.py:199:stop] epoch=0/micro_step=1180/global_step=1180, RunningAvgSamplesPerSec=232.05385962590864, CurrSamplesPerSec=202.66545063324588, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:15:37,769] [INFO] [logging.py:96:log_dist] [Rank 0] step=1190, skipped=14, lr=[4.064664486011433e-05, 4.064664486011433e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:15:37,778] [INFO] [timer.py:199:stop] epoch=0/micro_step=1190/global_step=1190, RunningAvgSamplesPerSec=232.10478875385354, CurrSamplesPerSec=238.86089436428594, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:15:40,496] [INFO] [logging.py:96:log_dist] [Rank 0] step=1200, skipped=14, lr=[4.0497874921810194e-05, 4.0497874921810194e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:15:40,505] [INFO] [timer.py:199:stop] epoch=0/micro_step=1200/global_step=1200, RunningAvgSamplesPerSec=232.1297612518395, CurrSamplesPerSec=238.84707975940492, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:15:43,187] [INFO] [logging.py:96:log_dist] [Rank 0] step=1210, skipped=14, lr=[4.0348208236987116e-05, 4.0348208236987116e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:15:43,196] [INFO] [timer.py:199:stop] epoch=0/micro_step=1210/global_step=1210, RunningAvgSamplesPerSec=232.17931929931834, CurrSamplesPerSec=237.87684498575055, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:15:45,597] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:15:45,597] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-04-14 08:15:45,597] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:15:45,597] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:15:45,597] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-04-14 08:15:45,597] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-04-14 08:15:45,597] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:15:45,597] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-04-14 08:15:45,597] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:15:45,597] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-04-14 08:15:45,597] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:15:45,598] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-04-14 08:15:45,597] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:15:45,598] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-04-14 08:15:45,598] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:15:45,598] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-04-14 08:15:45,877] [INFO] [logging.py:96:log_dist] [Rank 0] step=1220, skipped=14, lr=[4.0197653465740715e-05, 4.0197653465740715e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:15:45,886] [INFO] [timer.py:199:stop] epoch=0/micro_step=1220/global_step=1220, RunningAvgSamplesPerSec=232.22912042897087, CurrSamplesPerSec=238.46176587640002, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:15:48,562] [INFO] [logging.py:96:log_dist] [Rank 0] step=1230, skipped=14, lr=[4.0046219319553535e-05, 4.0046219319553535e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:15:48,571] [INFO] [timer.py:199:stop] epoch=0/micro_step=1230/global_step=1230, RunningAvgSamplesPerSec=232.28082966759314, CurrSamplesPerSec=238.78695171429766, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:15:51,247] [INFO] [logging.py:96:log_dist] [Rank 0] step=1240, skipped=14, lr=[3.989391456079101e-05, 3.989391456079101e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:15:51,256] [INFO] [timer.py:199:stop] epoch=0/micro_step=1240/global_step=1240, RunningAvgSamplesPerSec=232.33238920515356, CurrSamplesPerSec=238.66593760474709, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:15:53,933] [INFO] [logging.py:96:log_dist] [Rank 0] step=1250, skipped=14, lr=[3.974074800219444e-05, 3.974074800219444e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:15:53,942] [INFO] [timer.py:199:stop] epoch=0/micro_step=1250/global_step=1250, RunningAvgSamplesPerSec=232.38230208443062, CurrSamplesPerSec=239.18567855516716, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:15:56,617] [INFO] [logging.py:96:log_dist] [Rank 0] step=1260, skipped=14, lr=[3.9586728506371036e-05, 3.9586728506371036e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:15:56,892] [INFO] [timer.py:199:stop] epoch=0/micro_step=1260/global_step=1260, RunningAvgSamplesPerSec=232.253831932198, CurrSamplesPerSec=119.81377578233551, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:15:59,573] [INFO] [logging.py:96:log_dist] [Rank 0] step=1270, skipped=14, lr=[3.943186498528115e-05, 3.943186498528115e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:15:59,581] [INFO] [timer.py:199:stop] epoch=0/micro_step=1270/global_step=1270, RunningAvgSamplesPerSec=232.30158763588574, CurrSamplesPerSec=238.2596349712555, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:16:02,261] [INFO] [logging.py:96:log_dist] [Rank 0] step=1280, skipped=14, lr=[3.927616639972257e-05, 3.927616639972257e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:16:02,270] [INFO] [timer.py:199:stop] epoch=0/micro_step=1280/global_step=1280, RunningAvgSamplesPerSec=232.34872170964078, CurrSamplesPerSec=238.48379874768122, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:16:04,953] [INFO] [logging.py:96:log_dist] [Rank 0] step=1290, skipped=14, lr=[3.911964175881206e-05, 3.911964175881206e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:16:04,962] [INFO] [timer.py:199:stop] epoch=0/micro_step=1290/global_step=1290, RunningAvgSamplesPerSec=232.39329310836865, CurrSamplesPerSec=238.63835507079528, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:16:07,702] [INFO] [logging.py:96:log_dist] [Rank 0] step=1300, skipped=14, lr=[3.8962300119464034e-05, 3.8962300119464034e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:16:07,710] [INFO] [timer.py:199:stop] epoch=0/micro_step=1300/global_step=1300, RunningAvgSamplesPerSec=232.40183895517626, CurrSamplesPerSec=237.45116573653655, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:16:10,410] [INFO] [logging.py:96:log_dist] [Rank 0] step=1310, skipped=14, lr=[3.8804150585866527e-05, 3.8804150585866527e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:16:10,418] [INFO] [timer.py:199:stop] epoch=0/micro_step=1310/global_step=1310, RunningAvgSamplesPerSec=232.4360925161428, CurrSamplesPerSec=234.99537862612394, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:16:12,873] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:16:12,873] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:16:12,873] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:16:12,873] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:16:12,873] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:16:12,873] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-14 08:16:12,873] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:16:12,873] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-14 08:16:12,873] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-14 08:16:12,873] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-14 08:16:12,873] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-14 08:16:12,873] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:16:12,873] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-14 08:16:12,873] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-14 08:16:12,873] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:16:12,874] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-14 08:16:13,154] [INFO] [logging.py:96:log_dist] [Rank 0] step=1320, skipped=14, lr=[3.8645202308954386e-05, 3.8645202308954386e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:16:13,163] [INFO] [timer.py:199:stop] epoch=0/micro_step=1320/global_step=1320, RunningAvgSamplesPerSec=232.4485806643127, CurrSamplesPerSec=238.08508377554486, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:16:15,854] [INFO] [logging.py:96:log_dist] [Rank 0] step=1330, skipped=14, lr=[3.848546448587979e-05, 3.848546448587979e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:16:15,863] [INFO] [timer.py:199:stop] epoch=0/micro_step=1330/global_step=1330, RunningAvgSamplesPerSec=232.48670517765638, CurrSamplesPerSec=237.89244498799175, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:16:18,560] [INFO] [logging.py:96:log_dist] [Rank 0] step=1340, skipped=14, lr=[3.832494635948005e-05, 3.832494635948005e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:16:19,146] [INFO] [timer.py:199:stop] epoch=0/micro_step=1340/global_step=1340, RunningAvgSamplesPerSec=232.15677636043918, CurrSamplesPerSec=75.62603862898298, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:16:21,845] [INFO] [logging.py:96:log_dist] [Rank 0] step=1350, skipped=14, lr=[3.816365721774284e-05, 3.816365721774284e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:16:21,853] [INFO] [timer.py:199:stop] epoch=0/micro_step=1350/global_step=1350, RunningAvgSamplesPerSec=232.19224216351364, CurrSamplesPerSec=237.29101082872927, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:16:24,547] [INFO] [logging.py:96:log_dist] [Rank 0] step=1360, skipped=14, lr=[3.800160639326873e-05, 3.800160639326873e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:16:24,555] [INFO] [timer.py:199:stop] epoch=0/micro_step=1360/global_step=1360, RunningAvgSamplesPerSec=232.2304375112837, CurrSamplesPerSec=238.0975432359604, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:16:27,247] [INFO] [logging.py:96:log_dist] [Rank 0] step=1370, skipped=14, lr=[3.7838803262731205e-05, 3.7838803262731205e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:16:27,255] [INFO] [timer.py:199:stop] epoch=0/micro_step=1370/global_step=1370, RunningAvgSamplesPerSec=232.2689915673385, CurrSamplesPerSec=238.2697862512349, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:16:29,942] [INFO] [logging.py:96:log_dist] [Rank 0] step=1380, skipped=14, lr=[3.7675257246334085e-05, 3.7675257246334085e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:16:29,951] [INFO] [timer.py:199:stop] epoch=0/micro_step=1380/global_step=1380, RunningAvgSamplesPerSec=232.3095909917431, CurrSamplesPerSec=237.84586186028667, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:16:32,649] [INFO] [logging.py:96:log_dist] [Rank 0] step=1390, skipped=14, lr=[3.7510977807266456e-05, 3.7510977807266456e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:16:32,658] [INFO] [timer.py:199:stop] epoch=0/micro_step=1390/global_step=1390, RunningAvgSamplesPerSec=232.3431151280274, CurrSamplesPerSec=237.30212863011482, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:16:35,352] [INFO] [logging.py:96:log_dist] [Rank 0] step=1400, skipped=14, lr=[3.7345974451155105e-05, 3.7345974451155105e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:16:35,361] [INFO] [timer.py:199:stop] epoch=0/micro_step=1400/global_step=1400, RunningAvgSamplesPerSec=232.37836422379345, CurrSamplesPerSec=237.86440861196448, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:16:38,063] [INFO] [logging.py:96:log_dist] [Rank 0] step=1410, skipped=14, lr=[3.718025672551453e-05, 3.718025672551453e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:16:38,072] [INFO] [timer.py:199:stop] epoch=0/micro_step=1410/global_step=1410, RunningAvgSamplesPerSec=232.40831980797128, CurrSamplesPerSec=237.1381942597875, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:16:40,518] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:16:40,518] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:16:40,518] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:16:40,653] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:16:40,653] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:16:40,518] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:16:40,518] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:16:40,653] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:16:40,653] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:16:40,518] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:16:40,653] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:16:40,518] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:16:40,518] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:16:40,653] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:16:40,653] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:16:40,653] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:16:40,934] [INFO] [logging.py:96:log_dist] [Rank 0] step=1420, skipped=14, lr=[3.701383421919445e-05, 3.701383421919445e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:16:40,943] [INFO] [timer.py:199:stop] epoch=0/micro_step=1420/global_step=1420, RunningAvgSamplesPerSec=232.34292662348903, CurrSamplesPerSec=237.97237438287394, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:16:43,369] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1428 [2023-04-14 08:16:43,369] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1428 [2023-04-14 08:16:43,370] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:16:43,370] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:16:43,370] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1428 [2023-04-14 08:16:43,370] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1428 [2023-04-14 08:16:43,370] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:16:43,370] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1428 [2023-04-14 08:16:43,370] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1428 [2023-04-14 08:16:43,370] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:16:43,370] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:16:43,370] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:16:43,370] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1428 [2023-04-14 08:16:43,370] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:16:43,370] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1428 [2023-04-14 08:16:43,370] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0 [2023-04-14 08:16:43,370] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:16:43,631] [INFO] [logging.py:96:log_dist] [Rank 0] step=1430, skipped=15, lr=[3.686345933408103e-05, 3.686345933408103e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:16:43,640] [INFO] [timer.py:199:stop] epoch=0/micro_step=1430/global_step=1430, RunningAvgSamplesPerSec=232.3812368109915, CurrSamplesPerSec=238.81520599983986, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:16:46,326] [INFO] [logging.py:96:log_dist] [Rank 0] step=1440, skipped=15, lr=[3.66957243073569e-05, 3.66957243073569e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:16:46,334] [INFO] [timer.py:199:stop] epoch=0/micro_step=1440/global_step=1440, RunningAvgSamplesPerSec=232.4196469054076, CurrSamplesPerSec=238.3381184075008, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:16:49,023] [INFO] [logging.py:96:log_dist] [Rank 0] step=1450, skipped=15, lr=[3.652731253623315e-05, 3.652731253623315e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:16:49,031] [INFO] [timer.py:199:stop] epoch=0/micro_step=1450/global_step=1450, RunningAvgSamplesPerSec=232.45597146585658, CurrSamplesPerSec=235.72423906867854, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:16:51,716] [INFO] [logging.py:96:log_dist] [Rank 0] step=1460, skipped=15, lr=[3.635823376544385e-05, 3.635823376544385e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:16:51,724] [INFO] [timer.py:199:stop] epoch=0/micro_step=1460/global_step=1460, RunningAvgSamplesPerSec=232.4941230261541, CurrSamplesPerSec=239.13730940138046, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:16:54,418] [INFO] [logging.py:96:log_dist] [Rank 0] step=1470, skipped=15, lr=[3.618849777831736e-05, 3.618849777831736e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:16:54,427] [INFO] [timer.py:199:stop] epoch=0/micro_step=1470/global_step=1470, RunningAvgSamplesPerSec=232.5263944507818, CurrSamplesPerSec=238.33367456834364, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:16:57,110] [INFO] [logging.py:96:log_dist] [Rank 0] step=1480, skipped=15, lr=[3.601811439621024e-05, 3.601811439621024e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:16:57,119] [INFO] [timer.py:199:stop] epoch=0/micro_step=1480/global_step=1480, RunningAvgSamplesPerSec=232.56384535139267, CurrSamplesPerSec=238.25646287364668, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:16:59,800] [INFO] [logging.py:96:log_dist] [Rank 0] step=1490, skipped=15, lr=[3.5847093477938956e-05, 3.5847093477938956e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:16:59,808] [INFO] [timer.py:199:stop] epoch=0/micro_step=1490/global_step=1490, RunningAvgSamplesPerSec=232.6027064393841, CurrSamplesPerSec=238.13535195286892, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:17:02,493] [INFO] [logging.py:96:log_dist] [Rank 0] step=1500, skipped=15, lr=[3.5675444919209486e-05, 3.5675444919209486e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:17:02,501] [INFO] [timer.py:199:stop] epoch=0/micro_step=1500/global_step=1500, RunningAvgSamplesPerSec=232.63914456675244, CurrSamplesPerSec=238.11845271835213, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:17:05,185] [INFO] [logging.py:96:log_dist] [Rank 0] step=1510, skipped=15, lr=[3.550317865204465e-05, 3.550317865204465e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:17:05,559] [INFO] [timer.py:199:stop] epoch=0/micro_step=1510/global_step=1510, RunningAvgSamplesPerSec=232.47015228178003, CurrSamplesPerSec=100.94643814067985, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:17:08,250] [INFO] [logging.py:96:log_dist] [Rank 0] step=1520, skipped=15, lr=[3.533030464420946e-05, 3.533030464420946e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:17:08,259] [INFO] [timer.py:199:stop] epoch=0/micro_step=1520/global_step=1520, RunningAvgSamplesPerSec=232.50267792762364, CurrSamplesPerSec=232.42493164111542, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:17:10,965] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:17:10,965] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:17:10,966] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:17:10,966] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:17:10,966] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:17:10,966] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:17:10,966] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:17:10,966] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:17:10,966] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:17:10,966] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:17:10,966] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:17:10,966] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:17:10,966] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:17:10,966] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:17:10,966] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:17:10,966] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:17:10,977] [INFO] [logging.py:96:log_dist] [Rank 0] step=1530, skipped=15, lr=[3.515683289863435e-05, 3.515683289863435e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:17:10,986] [INFO] [timer.py:199:stop] epoch=0/micro_step=1530/global_step=1530, RunningAvgSamplesPerSec=232.52011467607667, CurrSamplesPerSec=237.6571867195157, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:17:11,768] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1532 [2023-04-14 08:17:11,768] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:17:11,768] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1532 [2023-04-14 08:17:11,768] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1532 [2023-04-14 08:17:11,768] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:17:11,768] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:17:11,768] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1532 [2023-04-14 08:17:11,768] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1532 [2023-04-14 08:17:11,768] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1532 [2023-04-14 08:17:11,768] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:17:11,768] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:17:11,768] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:17:11,768] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1532 [2023-04-14 08:17:11,768] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:17:11,768] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1532 [2023-04-14 08:17:11,769] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:17:11,769] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0 [2023-04-14 08:17:13,642] [INFO] [logging.py:96:log_dist] [Rank 0] step=1540, skipped=16, lr=[3.500020555715168e-05, 3.500020555715168e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:17:13,651] [INFO] [timer.py:199:stop] epoch=0/micro_step=1540/global_step=1540, RunningAvgSamplesPerSec=232.57089658484665, CurrSamplesPerSec=238.5382628284022, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:17:16,351] [INFO] [logging.py:96:log_dist] [Rank 0] step=1550, skipped=16, lr=[3.4825625791348096e-05, 3.4825625791348096e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:17:16,360] [INFO] [timer.py:199:stop] epoch=0/micro_step=1550/global_step=1550, RunningAvgSamplesPerSec=232.59751136140997, CurrSamplesPerSec=237.58924439959995, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:17:19,042] [INFO] [logging.py:96:log_dist] [Rank 0] step=1560, skipped=16, lr=[3.4650477489808525e-05, 3.4650477489808525e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:17:19,051] [INFO] [timer.py:199:stop] epoch=0/micro_step=1560/global_step=1560, RunningAvgSamplesPerSec=232.63285756347932, CurrSamplesPerSec=238.4181358590712, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:17:21,729] [INFO] [logging.py:96:log_dist] [Rank 0] step=1570, skipped=16, lr=[3.447477078705983e-05, 3.447477078705983e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:17:21,738] [INFO] [timer.py:199:stop] epoch=0/micro_step=1570/global_step=1570, RunningAvgSamplesPerSec=232.67012354199923, CurrSamplesPerSec=238.21079403167323, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:17:24,418] [INFO] [logging.py:96:log_dist] [Rank 0] step=1580, skipped=16, lr=[3.429851584993941e-05, 3.429851584993941e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:17:24,427] [INFO] [timer.py:199:stop] epoch=0/micro_step=1580/global_step=1580, RunningAvgSamplesPerSec=232.7062495690848, CurrSamplesPerSec=238.62605107380872, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:17:27,111] [INFO] [logging.py:96:log_dist] [Rank 0] step=1590, skipped=16, lr=[3.412172287700685e-05, 3.412172287700685e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:17:27,120] [INFO] [timer.py:199:stop] epoch=0/micro_step=1590/global_step=1590, RunningAvgSamplesPerSec=232.73959855041318, CurrSamplesPerSec=239.2611479274646, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:17:29,798] [INFO] [logging.py:96:log_dist] [Rank 0] step=1600, skipped=16, lr=[3.394440209795392e-05, 3.394440209795392e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:17:29,806] [INFO] [timer.py:199:stop] epoch=0/micro_step=1600/global_step=1600, RunningAvgSamplesPerSec=232.7758028617994, CurrSamplesPerSec=238.4075484567726, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:17:32,486] [INFO] [logging.py:96:log_dist] [Rank 0] step=1610, skipped=16, lr=[3.3766563773012535e-05, 3.3766563773012535e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:17:32,495] [INFO] [timer.py:199:stop] epoch=0/micro_step=1610/global_step=1610, RunningAvgSamplesPerSec=232.81052181445983, CurrSamplesPerSec=238.85643100950674, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:17:35,172] [INFO] [logging.py:96:log_dist] [Rank 0] step=1620, skipped=16, lr=[3.358821819236119e-05, 3.358821819236119e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:17:35,181] [INFO] [timer.py:199:stop] epoch=0/micro_step=1620/global_step=1620, RunningAvgSamplesPerSec=232.84589693066536, CurrSamplesPerSec=238.49884275815378, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:17:37,861] [INFO] [logging.py:96:log_dist] [Rank 0] step=1630, skipped=16, lr=[3.340937567552944e-05, 3.340937567552944e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:17:38,124] [INFO] [timer.py:199:stop] epoch=0/micro_step=1630/global_step=1630, RunningAvgSamplesPerSec=232.74760593917557, CurrSamplesPerSec=122.72552568643381, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:17:39,179] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:17:39,179] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:17:39,179] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:17:39,179] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:17:39,179] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:17:39,179] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:17:39,179] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:17:39,179] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:17:39,179] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:17:39,180] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:17:39,180] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:17:39,180] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:17:39,180] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:17:39,180] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:17:39,180] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:17:39,180] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:17:39,709] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1635 [2023-04-14 08:17:39,709] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1635 [2023-04-14 08:17:39,710] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:17:39,710] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:17:39,710] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1635 [2023-04-14 08:17:39,710] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0 [2023-04-14 08:17:39,710] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:17:39,710] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1635 [2023-04-14 08:17:39,710] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1635 [2023-04-14 08:17:39,710] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:17:39,710] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:17:39,710] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1635 [2023-04-14 08:17:39,710] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:17:39,710] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1635 [2023-04-14 08:17:39,710] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1635 [2023-04-14 08:17:39,710] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:17:39,710] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:17:40,778] [INFO] [logging.py:96:log_dist] [Rank 0] step=1640, skipped=17, lr=[3.3248001082218514e-05, 3.3248001082218514e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:17:40,787] [INFO] [timer.py:199:stop] epoch=0/micro_step=1640/global_step=1640, RunningAvgSamplesPerSec=232.79513223539237, CurrSamplesPerSec=238.44778558878343, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:17:43,478] [INFO] [logging.py:96:log_dist] [Rank 0] step=1650, skipped=17, lr=[3.3068242919447926e-05, 3.3068242919447926e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:17:43,486] [INFO] [timer.py:199:stop] epoch=0/micro_step=1650/global_step=1650, RunningAvgSamplesPerSec=232.8234725736437, CurrSamplesPerSec=234.668121346908, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:17:46,172] [INFO] [logging.py:96:log_dist] [Rank 0] step=1660, skipped=17, lr=[3.2888017907590715e-05, 3.2888017907590715e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:17:46,181] [INFO] [timer.py:199:stop] epoch=0/micro_step=1660/global_step=1660, RunningAvgSamplesPerSec=232.85415774663363, CurrSamplesPerSec=239.13219662551, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:17:48,881] [INFO] [logging.py:96:log_dist] [Rank 0] step=1670, skipped=17, lr=[3.270733647492512e-05, 3.270733647492512e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:17:48,890] [INFO] [timer.py:199:stop] epoch=0/micro_step=1670/global_step=1670, RunningAvgSamplesPerSec=232.87683309542362, CurrSamplesPerSec=239.08363615635642, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:17:51,624] [INFO] [logging.py:96:log_dist] [Rank 0] step=1680, skipped=17, lr=[3.252620907613907e-05, 3.252620907613907e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:17:51,633] [INFO] [timer.py:199:stop] epoch=0/micro_step=1680/global_step=1680, RunningAvgSamplesPerSec=232.88227081285683, CurrSamplesPerSec=238.60081063784398, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:17:54,321] [INFO] [logging.py:96:log_dist] [Rank 0] step=1690, skipped=17, lr=[3.234464619172522e-05, 3.234464619172522e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:17:54,330] [INFO] [timer.py:199:stop] epoch=0/micro_step=1690/global_step=1690, RunningAvgSamplesPerSec=232.91089419969552, CurrSamplesPerSec=239.12303678213547, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:17:57,009] [INFO] [logging.py:96:log_dist] [Rank 0] step=1700, skipped=17, lr=[3.216265832737454e-05, 3.216265832737454e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:17:57,018] [INFO] [timer.py:199:stop] epoch=0/micro_step=1700/global_step=1700, RunningAvgSamplesPerSec=232.9432832575674, CurrSamplesPerSec=239.01083242068898, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:17:59,701] [INFO] [logging.py:96:log_dist] [Rank 0] step=1710, skipped=17, lr=[3.198025601336841e-05, 3.198025601336841e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:17:59,710] [INFO] [timer.py:199:stop] epoch=0/micro_step=1710/global_step=1710, RunningAvgSamplesPerSec=232.97332858256274, CurrSamplesPerSec=238.94232672738536, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:18:02,386] [INFO] [logging.py:96:log_dist] [Rank 0] step=1720, skipped=17, lr=[3.1797449803969354e-05, 3.1797449803969354e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:18:02,727] [INFO] [timer.py:199:stop] epoch=0/micro_step=1720/global_step=1720, RunningAvgSamplesPerSec=232.84284723246662, CurrSamplesPerSec=106.79577647461349, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:18:05,412] [INFO] [logging.py:96:log_dist] [Rank 0] step=1730, skipped=17, lr=[3.161425027681026e-05, 3.161425027681026e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:18:05,421] [INFO] [timer.py:199:stop] epoch=0/micro_step=1730/global_step=1730, RunningAvgSamplesPerSec=232.87184712601987, CurrSamplesPerSec=238.782915933541, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:18:07,287] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:18:07,288] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:18:07,288] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:18:07,288] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:18:07,288] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:18:07,288] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:18:07,288] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:18:07,288] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:18:07,288] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:18:07,288] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:18:07,288] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:18:07,288] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:18:07,288] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:18:07,288] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:18:07,288] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:18:07,288] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:18:07,817] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1738 [2023-04-14 08:18:07,817] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1738 [2023-04-14 08:18:07,817] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1738 [2023-04-14 08:18:07,817] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:18:07,817] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:18:07,817] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1738 [2023-04-14 08:18:07,817] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:18:07,817] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:18:07,817] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0 [2023-04-14 08:18:07,817] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1738 [2023-04-14 08:18:07,817] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1738 [2023-04-14 08:18:07,817] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1738 [2023-04-14 08:18:07,817] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:18:07,817] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:18:07,817] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:18:07,817] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1738 [2023-04-14 08:18:07,817] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:18:08,078] [INFO] [logging.py:96:log_dist] [Rank 0] step=1740, skipped=18, lr=[3.14490431764453e-05, 3.14490431764453e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:18:08,087] [INFO] [timer.py:199:stop] epoch=0/micro_step=1740/global_step=1740, RunningAvgSamplesPerSec=232.91475439359183, CurrSamplesPerSec=238.69713476781678, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:18:10,765] [INFO] [logging.py:96:log_dist] [Rank 0] step=1750, skipped=18, lr=[3.126512556793569e-05, 3.126512556793569e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:18:10,774] [INFO] [timer.py:199:stop] epoch=0/micro_step=1750/global_step=1750, RunningAvgSamplesPerSec=232.94662151204616, CurrSamplesPerSec=238.66233029562125, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:18:13,454] [INFO] [logging.py:96:log_dist] [Rank 0] step=1760, skipped=18, lr=[3.108084544330228e-05, 3.108084544330228e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:18:13,463] [INFO] [timer.py:199:stop] epoch=0/micro_step=1760/global_step=1760, RunningAvgSamplesPerSec=232.97740049781007, CurrSamplesPerSec=237.82521092681992, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:18:16,173] [INFO] [logging.py:96:log_dist] [Rank 0] step=1770, skipped=18, lr=[3.089621346546249e-05, 3.089621346546249e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:18:16,181] [INFO] [timer.py:199:stop] epoch=0/micro_step=1770/global_step=1770, RunningAvgSamplesPerSec=232.9936571996118, CurrSamplesPerSec=233.80659501841726, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:18:18,874] [INFO] [logging.py:96:log_dist] [Rank 0] step=1780, skipped=18, lr=[3.071124031769283e-05, 3.071124031769283e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:18:18,883] [INFO] [timer.py:199:stop] epoch=0/micro_step=1780/global_step=1780, RunningAvgSamplesPerSec=233.01820229656133, CurrSamplesPerSec=237.68980441881718, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:18:19,115] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1780 [2023-04-14 08:18:19,115] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-14 08:18:19,115] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 16384.0, reducing to 8192.0 [2023-04-14 08:18:19,115] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1780 [2023-04-14 08:18:19,115] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1780 [2023-04-14 08:18:19,115] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1780 [2023-04-14 08:18:19,115] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1780 [2023-04-14 08:18:19,115] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-14 08:18:19,115] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-14 08:18:19,115] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1780 [2023-04-14 08:18:19,115] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1780 [2023-04-14 08:18:19,115] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1780 [2023-04-14 08:18:19,115] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-14 08:18:19,115] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-14 08:18:19,115] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-14 08:18:19,115] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-14 08:18:19,115] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-14 08:18:21,528] [INFO] [logging.py:96:log_dist] [Rank 0] step=1790, skipped=19, lr=[3.0544481630056134e-05, 3.0544481630056134e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:18:21,537] [INFO] [timer.py:199:stop] epoch=0/micro_step=1790/global_step=1790, RunningAvgSamplesPerSec=233.06441668348802, CurrSamplesPerSec=238.14802795296933, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:18:24,248] [INFO] [logging.py:96:log_dist] [Rank 0] step=1800, skipped=19, lr=[3.035888976204979e-05, 3.035888976204979e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:18:24,257] [INFO] [timer.py:199:stop] epoch=0/micro_step=1800/global_step=1800, RunningAvgSamplesPerSec=233.0791939483406, CurrSamplesPerSec=235.97020346665118, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:18:26,947] [INFO] [logging.py:96:log_dist] [Rank 0] step=1810, skipped=19, lr=[3.0172987815031722e-05, 3.0172987815031722e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:18:26,956] [INFO] [timer.py:199:stop] epoch=0/micro_step=1810/global_step=1810, RunningAvgSamplesPerSec=233.10362508764408, CurrSamplesPerSec=238.4524455025139, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:18:29,642] [INFO] [logging.py:96:log_dist] [Rank 0] step=1820, skipped=19, lr=[2.998678654576212e-05, 2.998678654576212e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:18:29,651] [INFO] [timer.py:199:stop] epoch=0/micro_step=1820/global_step=1820, RunningAvgSamplesPerSec=233.12973451974648, CurrSamplesPerSec=238.99040425353854, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:18:32,340] [INFO] [logging.py:96:log_dist] [Rank 0] step=1830, skipped=19, lr=[2.9800296728320703e-05, 2.9800296728320703e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:18:32,349] [INFO] [timer.py:199:stop] epoch=0/micro_step=1830/global_step=1830, RunningAvgSamplesPerSec=233.15387881655056, CurrSamplesPerSec=235.1971583805082, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:18:33,660] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1834 [2023-04-14 08:18:33,660] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1834 [2023-04-14 08:18:33,660] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-04-14 08:18:33,660] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1834 [2023-04-14 08:18:33,660] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1834 [2023-04-14 08:18:33,660] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-04-14 08:18:33,660] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-04-14 08:18:33,660] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-04-14 08:18:33,660] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 8192.0, reducing to 4096.0 [2023-04-14 08:18:33,660] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1834 [2023-04-14 08:18:33,660] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1834 [2023-04-14 08:18:33,660] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1834 [2023-04-14 08:18:33,660] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-04-14 08:18:33,660] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-04-14 08:18:33,661] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-04-14 08:18:33,660] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 1834 [2023-04-14 08:18:33,661] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-04-14 08:18:34,997] [INFO] [logging.py:96:log_dist] [Rank 0] step=1840, skipped=20, lr=[2.9632218102177862e-05, 2.9632218102177862e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:18:35,006] [INFO] [timer.py:199:stop] epoch=0/micro_step=1840/global_step=1840, RunningAvgSamplesPerSec=233.19704958366196, CurrSamplesPerSec=238.86238218628094, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:18:37,696] [INFO] [logging.py:96:log_dist] [Rank 0] step=1850, skipped=20, lr=[2.9445209785093726e-05, 2.9445209785093726e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:18:37,705] [INFO] [timer.py:199:stop] epoch=0/micro_step=1850/global_step=1850, RunningAvgSamplesPerSec=233.22017413644204, CurrSamplesPerSec=238.79226215998392, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:18:40,394] [INFO] [logging.py:96:log_dist] [Rank 0] step=1860, skipped=20, lr=[2.925794425684865e-05, 2.925794425684865e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:18:40,403] [INFO] [timer.py:199:stop] epoch=0/micro_step=1860/global_step=1860, RunningAvgSamplesPerSec=233.24358821482468, CurrSamplesPerSec=238.2228438311184, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:18:43,084] [INFO] [logging.py:96:log_dist] [Rank 0] step=1870, skipped=20, lr=[2.90704323531031e-05, 2.90704323531031e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:18:43,093] [INFO] [timer.py:199:stop] epoch=0/micro_step=1870/global_step=1870, RunningAvgSamplesPerSec=233.27042235508452, CurrSamplesPerSec=238.9525362654266, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:18:45,775] [INFO] [logging.py:96:log_dist] [Rank 0] step=1880, skipped=20, lr=[2.8882684923773458e-05, 2.8882684923773458e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:18:45,784] [INFO] [timer.py:199:stop] epoch=0/micro_step=1880/global_step=1880, RunningAvgSamplesPerSec=233.29686451508877, CurrSamplesPerSec=237.1832431496319, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:18:48,470] [INFO] [logging.py:96:log_dist] [Rank 0] step=1890, skipped=20, lr=[2.8694712832404198e-05, 2.8694712832404198e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:18:48,479] [INFO] [timer.py:199:stop] epoch=0/micro_step=1890/global_step=1890, RunningAvgSamplesPerSec=233.3208314051739, CurrSamplesPerSec=232.7857197366504, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:18:51,179] [INFO] [logging.py:96:log_dist] [Rank 0] step=1900, skipped=20, lr=[2.8506526955539338e-05, 2.8506526955539338e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:18:51,188] [INFO] [timer.py:199:stop] epoch=0/micro_step=1900/global_step=1900, RunningAvgSamplesPerSec=233.33989817103276, CurrSamplesPerSec=236.69302747181044, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:18:53,918] [INFO] [logging.py:96:log_dist] [Rank 0] step=1910, skipped=20, lr=[2.8318138182093052e-05, 2.8318138182093052e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:18:53,927] [INFO] [timer.py:199:stop] epoch=0/micro_step=1910/global_step=1910, RunningAvgSamplesPerSec=233.34617106214046, CurrSamplesPerSec=237.44444454272772, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:18:56,610] [INFO] [logging.py:96:log_dist] [Rank 0] step=1920, skipped=20, lr=[2.8129557412719638e-05, 2.8129557412719638e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:18:56,619] [INFO] [timer.py:199:stop] epoch=0/micro_step=1920/global_step=1920, RunningAvgSamplesPerSec=233.37154141722866, CurrSamplesPerSec=238.22686070184218, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:18:59,305] [INFO] [logging.py:96:log_dist] [Rank 0] step=1930, skipped=20, lr=[2.7940795559182764e-05, 2.7940795559182764e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:18:59,313] [INFO] [timer.py:199:stop] epoch=0/micro_step=1930/global_step=1930, RunningAvgSamplesPerSec=233.39533276879624, CurrSamplesPerSec=238.19557497466621, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:19:00,909] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:19:00,909] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:19:00,909] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-04-14 08:19:00,909] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-04-14 08:19:00,909] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:19:00,910] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-04-14 08:19:00,910] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:19:00,910] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-04-14 08:19:00,910] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:19:00,910] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-04-14 08:19:00,910] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:19:00,910] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-04-14 08:19:00,910] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:19:00,910] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-04-14 08:19:00,910] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:19:00,910] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-04-14 08:19:01,998] [INFO] [logging.py:96:log_dist] [Rank 0] step=1940, skipped=20, lr=[2.7751863543724076e-05, 2.7751863543724076e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:19:02,007] [INFO] [timer.py:199:stop] epoch=0/micro_step=1940/global_step=1940, RunningAvgSamplesPerSec=233.4195668534499, CurrSamplesPerSec=237.2119576503565, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:19:04,703] [INFO] [logging.py:96:log_dist] [Rank 0] step=1950, skipped=20, lr=[2.756277229843125e-05, 2.756277229843125e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:19:05,086] [INFO] [timer.py:199:stop] epoch=0/micro_step=1950/global_step=1950, RunningAvgSamplesPerSec=233.27519092141634, CurrSamplesPerSec=99.65565278081993, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:19:07,780] [INFO] [logging.py:96:log_dist] [Rank 0] step=1960, skipped=20, lr=[2.7373532764605368e-05, 2.7373532764605368e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:19:07,788] [INFO] [timer.py:199:stop] epoch=0/micro_step=1960/global_step=1960, RunningAvgSamplesPerSec=233.29938610800372, CurrSamplesPerSec=238.34404378416198, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:19:10,475] [INFO] [logging.py:96:log_dist] [Rank 0] step=1970, skipped=20, lr=[2.718415589212791e-05, 2.718415589212791e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:19:10,483] [INFO] [timer.py:199:stop] epoch=0/micro_step=1970/global_step=1970, RunningAvgSamplesPerSec=233.32278779471147, CurrSamplesPerSec=238.9978516183199, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:19:13,170] [INFO] [logging.py:96:log_dist] [Rank 0] step=1980, skipped=20, lr=[2.6994652638827078e-05, 2.6994652638827078e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:19:13,179] [INFO] [timer.py:199:stop] epoch=0/micro_step=1980/global_step=1980, RunningAvgSamplesPerSec=233.34576459013405, CurrSamplesPerSec=238.85664354642807, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:19:15,861] [INFO] [logging.py:96:log_dist] [Rank 0] step=1990, skipped=20, lr=[2.680503396984382e-05, 2.680503396984382e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:19:15,870] [INFO] [timer.py:199:stop] epoch=0/micro_step=1990/global_step=1990, RunningAvgSamplesPerSec=233.3702500204557, CurrSamplesPerSec=237.5694790092077, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:19:18,567] [INFO] [logging.py:96:log_dist] [Rank 0] step=2000, skipped=20, lr=[2.66153108569973e-05, 2.66153108569973e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:19:18,576] [INFO] [timer.py:199:stop] epoch=0/micro_step=2000/global_step=2000, RunningAvgSamplesPerSec=233.3878942544803, CurrSamplesPerSec=238.81478107384842, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:19:21,286] [INFO] [logging.py:96:log_dist] [Rank 0] step=2010, skipped=20, lr=[2.64254942781501e-05, 2.64254942781501e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:19:21,295] [INFO] [timer.py:199:stop] epoch=0/micro_step=2010/global_step=2010, RunningAvgSamplesPerSec=233.40239469352693, CurrSamplesPerSec=238.8279544827709, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:19:24,001] [INFO] [logging.py:96:log_dist] [Rank 0] step=2020, skipped=20, lr=[2.623559521657296e-05, 2.623559521657296e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:19:24,009] [INFO] [timer.py:199:stop] epoch=0/micro_step=2020/global_step=2020, RunningAvgSamplesPerSec=233.41708550701804, CurrSamplesPerSec=235.49490033143897, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:19:26,700] [INFO] [logging.py:96:log_dist] [Rank 0] step=2030, skipped=20, lr=[2.604562466030931e-05, 2.604562466030931e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:19:26,709] [INFO] [timer.py:199:stop] epoch=0/micro_step=2030/global_step=2030, RunningAvgSamplesPerSec=233.4373758437724, CurrSamplesPerSec=238.13070442426377, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:19:28,324] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:19:28,324] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:19:28,324] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-14 08:19:28,324] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-14 08:19:28,324] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:19:28,324] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:19:28,324] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:19:28,325] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-14 08:19:28,325] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-14 08:19:28,324] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:19:28,325] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-14 08:19:28,325] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-14 08:19:28,325] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:19:28,325] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-14 08:19:28,325] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:19:28,325] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-14 08:19:29,415] [INFO] [logging.py:96:log_dist] [Rank 0] step=2040, skipped=20, lr=[2.5855593601539412e-05, 2.5855593601539412e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:19:29,424] [INFO] [timer.py:199:stop] epoch=0/micro_step=2040/global_step=2040, RunningAvgSamplesPerSec=233.45087370673502, CurrSamplesPerSec=238.69331426884486, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:19:32,103] [INFO] [logging.py:96:log_dist] [Rank 0] step=2050, skipped=20, lr=[2.566551303594437e-05, 2.566551303594437e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:19:32,112] [INFO] [timer.py:199:stop] epoch=0/micro_step=2050/global_step=2050, RunningAvgSamplesPerSec=233.47532167146423, CurrSamplesPerSec=237.96034990798427, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:19:34,794] [INFO] [logging.py:96:log_dist] [Rank 0] step=2060, skipped=20, lr=[2.5475393962069882e-05, 2.5475393962069882e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:19:34,803] [INFO] [timer.py:199:stop] epoch=0/micro_step=2060/global_step=2060, RunningAvgSamplesPerSec=233.49834478807756, CurrSamplesPerSec=238.21248515805672, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:19:37,491] [INFO] [logging.py:96:log_dist] [Rank 0] step=2070, skipped=20, lr=[2.5285247380689836e-05, 2.5285247380689836e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:19:37,500] [INFO] [timer.py:199:stop] epoch=0/micro_step=2070/global_step=2070, RunningAvgSamplesPerSec=233.51873426746232, CurrSamplesPerSec=238.90447397228576, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:19:39,351] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2076 [2023-04-14 08:19:39,351] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-14 08:19:39,351] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2076 [2023-04-14 08:19:39,351] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2076 [2023-04-14 08:19:39,351] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2076 [2023-04-14 08:19:39,351] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-14 08:19:39,351] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2076 [2023-04-14 08:19:39,351] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2076 [2023-04-14 08:19:39,351] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-14 08:19:39,351] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-14 08:19:39,351] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-14 08:19:39,351] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-14 08:19:39,351] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2076 [2023-04-14 08:19:39,351] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 16384.0, reducing to 8192.0 [2023-04-14 08:19:39,351] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2076 [2023-04-14 08:19:39,351] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-14 08:19:39,351] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-14 08:19:40,153] [INFO] [logging.py:96:log_dist] [Rank 0] step=2080, skipped=21, lr=[2.511410103196303e-05, 2.511410103196303e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:19:40,162] [INFO] [timer.py:199:stop] epoch=0/micro_step=2080/global_step=2080, RunningAvgSamplesPerSec=233.55337785369906, CurrSamplesPerSec=236.4126529235468, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:19:42,853] [INFO] [logging.py:96:log_dist] [Rank 0] step=2090, skipped=21, lr=[2.4923932498641955e-05, 2.4923932498641955e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:19:42,862] [INFO] [timer.py:199:stop] epoch=0/micro_step=2090/global_step=2090, RunningAvgSamplesPerSec=233.57210660940748, CurrSamplesPerSec=238.8033086436447, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:19:45,541] [INFO] [logging.py:96:log_dist] [Rank 0] step=2100, skipped=21, lr=[2.473376836678028e-05, 2.473376836678028e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:19:45,549] [INFO] [timer.py:199:stop] epoch=0/micro_step=2100/global_step=2100, RunningAvgSamplesPerSec=233.59568707197033, CurrSamplesPerSec=239.15648425820498, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:19:48,238] [INFO] [logging.py:96:log_dist] [Rank 0] step=2110, skipped=21, lr=[2.4543619639759023e-05, 2.4543619639759023e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:19:48,247] [INFO] [timer.py:199:stop] epoch=0/micro_step=2110/global_step=2110, RunningAvgSamplesPerSec=233.61493610653005, CurrSamplesPerSec=237.87663418892035, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:19:49,822] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2115 [2023-04-14 08:19:49,822] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2115 [2023-04-14 08:19:49,822] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-04-14 08:19:49,822] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2115 [2023-04-14 08:19:49,823] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-04-14 08:19:49,823] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-04-14 08:19:49,822] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2115 [2023-04-14 08:19:49,823] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2115 [2023-04-14 08:19:49,823] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2115 [2023-04-14 08:19:49,823] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 8192.0, reducing to 4096.0 [2023-04-14 08:19:49,823] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2115 [2023-04-14 08:19:49,823] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-04-14 08:19:49,823] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-04-14 08:19:49,823] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-04-14 08:19:49,823] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2115 [2023-04-14 08:19:49,823] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-04-14 08:19:49,823] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-04-14 08:19:50,890] [INFO] [logging.py:96:log_dist] [Rank 0] step=2120, skipped=22, lr=[2.4372508050164882e-05, 2.4372508050164882e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:19:50,899] [INFO] [timer.py:199:stop] epoch=0/micro_step=2120/global_step=2120, RunningAvgSamplesPerSec=233.6521244735603, CurrSamplesPerSec=238.61905109008902, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:19:53,579] [INFO] [logging.py:96:log_dist] [Rank 0] step=2130, skipped=22, lr=[2.418241890294983e-05, 2.418241890294983e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:19:53,588] [INFO] [timer.py:199:stop] epoch=0/micro_step=2130/global_step=2130, RunningAvgSamplesPerSec=233.67461882433165, CurrSamplesPerSec=237.63614779299348, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:19:56,284] [INFO] [logging.py:96:log_dist] [Rank 0] step=2140, skipped=22, lr=[2.399237706305959e-05, 2.399237706305959e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:19:56,292] [INFO] [timer.py:199:stop] epoch=0/micro_step=2140/global_step=2140, RunningAvgSamplesPerSec=233.69077787176153, CurrSamplesPerSec=237.70117010744727, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:19:58,989] [INFO] [logging.py:96:log_dist] [Rank 0] step=2150, skipped=22, lr=[2.380239352679908e-05, 2.380239352679908e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:19:58,998] [INFO] [timer.py:199:stop] epoch=0/micro_step=2150/global_step=2150, RunningAvgSamplesPerSec=233.70602508384144, CurrSamplesPerSec=239.0050866457075, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:20:01,734] [INFO] [logging.py:96:log_dist] [Rank 0] step=2160, skipped=22, lr=[2.3612479287099633e-05, 2.3612479287099633e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:20:01,743] [INFO] [timer.py:199:stop] epoch=0/micro_step=2160/global_step=2160, RunningAvgSamplesPerSec=233.70693227403802, CurrSamplesPerSec=238.60038647446052, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:20:04,421] [INFO] [logging.py:96:log_dist] [Rank 0] step=2170, skipped=22, lr=[2.3422645332882906e-05, 2.3422645332882906e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:20:04,430] [INFO] [timer.py:199:stop] epoch=0/micro_step=2170/global_step=2170, RunningAvgSamplesPerSec=233.72928604547945, CurrSamplesPerSec=238.57981636046748, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:20:07,111] [INFO] [logging.py:96:log_dist] [Rank 0] step=2180, skipped=22, lr=[2.323290264842504e-05, 2.323290264842504e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:20:07,309] [INFO] [timer.py:199:stop] epoch=0/micro_step=2180/global_step=2180, RunningAvgSamplesPerSec=233.67624436374007, CurrSamplesPerSec=140.0913375076847, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:20:09,999] [INFO] [logging.py:96:log_dist] [Rank 0] step=2190, skipped=22, lr=[2.3043262212721055e-05, 2.3043262212721055e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:20:10,008] [INFO] [timer.py:199:stop] epoch=0/micro_step=2190/global_step=2190, RunningAvgSamplesPerSec=233.69373668252177, CurrSamplesPerSec=238.79013795336556, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:20:12,692] [INFO] [logging.py:96:log_dist] [Rank 0] step=2200, skipped=22, lr=[2.2853734998849614e-05, 2.2853734998849614e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:20:12,701] [INFO] [timer.py:199:stop] epoch=0/micro_step=2200/global_step=2200, RunningAvgSamplesPerSec=233.71381283910793, CurrSamplesPerSec=237.49318182019258, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:20:15,384] [INFO] [logging.py:96:log_dist] [Rank 0] step=2210, skipped=22, lr=[2.2664331973338083e-05, 2.2664331973338083e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:20:15,393] [INFO] [timer.py:199:stop] epoch=0/micro_step=2210/global_step=2210, RunningAvgSamplesPerSec=233.73400356777714, CurrSamplesPerSec=238.8817255590153, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:20:17,269] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:20:17,269] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-04-14 08:20:17,269] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:20:17,269] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-04-14 08:20:17,269] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:20:17,269] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-04-14 08:20:17,269] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:20:17,270] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:20:17,270] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-04-14 08:20:17,270] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-04-14 08:20:17,270] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:20:17,270] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-04-14 08:20:17,270] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:20:17,270] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-04-14 08:20:17,273] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:20:17,273] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-04-14 08:20:18,093] [INFO] [logging.py:96:log_dist] [Rank 0] step=2220, skipped=22, lr=[2.2475064095527948e-05, 2.2475064095527948e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:20:18,102] [INFO] [timer.py:199:stop] epoch=0/micro_step=2220/global_step=2220, RunningAvgSamplesPerSec=233.74717022433677, CurrSamplesPerSec=238.37388676604704, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:20:20,786] [INFO] [logging.py:96:log_dist] [Rank 0] step=2230, skipped=22, lr=[2.2285942316940733e-05, 2.2285942316940733e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:20:20,795] [INFO] [timer.py:199:stop] epoch=0/micro_step=2230/global_step=2230, RunningAvgSamplesPerSec=233.76656078123278, CurrSamplesPerSec=238.5041403967673, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:20:23,523] [INFO] [logging.py:96:log_dist] [Rank 0] step=2240, skipped=22, lr=[2.2096977580644263e-05, 2.2096977580644263e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:20:23,532] [INFO] [timer.py:199:stop] epoch=0/micro_step=2240/global_step=2240, RunningAvgSamplesPerSec=233.76914135938551, CurrSamplesPerSec=236.74291827362774, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:20:26,212] [INFO] [logging.py:96:log_dist] [Rank 0] step=2250, skipped=22, lr=[2.1908180820619516e-05, 2.1908180820619516e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:20:26,221] [INFO] [timer.py:199:stop] epoch=0/micro_step=2250/global_step=2250, RunningAvgSamplesPerSec=233.7894753305264, CurrSamplesPerSec=239.0165784719384, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:20:28,912] [INFO] [logging.py:96:log_dist] [Rank 0] step=2260, skipped=22, lr=[2.1719562961127923e-05, 2.1719562961127923e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:20:28,921] [INFO] [timer.py:199:stop] epoch=0/micro_step=2260/global_step=2260, RunningAvgSamplesPerSec=233.80594864140605, CurrSamplesPerSec=237.873261490446, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:20:31,611] [INFO] [logging.py:96:log_dist] [Rank 0] step=2270, skipped=22, lr=[2.1531134916079286e-05, 2.1531134916079286e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:20:31,620] [INFO] [timer.py:199:stop] epoch=0/micro_step=2270/global_step=2270, RunningAvgSamplesPerSec=233.82248902446733, CurrSamplesPerSec=236.28820889203038, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:20:34,364] [INFO] [logging.py:96:log_dist] [Rank 0] step=2280, skipped=22, lr=[2.1342907588400247e-05, 2.1342907588400247e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:20:34,373] [INFO] [timer.py:199:stop] epoch=0/micro_step=2280/global_step=2280, RunningAvgSamplesPerSec=233.8182228469893, CurrSamplesPerSec=237.9445300817361, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:20:37,065] [INFO] [logging.py:96:log_dist] [Rank 0] step=2290, skipped=22, lr=[2.1154891869403435e-05, 2.1154891869403435e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:20:37,073] [INFO] [timer.py:199:stop] epoch=0/micro_step=2290/global_step=2290, RunningAvgSamplesPerSec=233.83407002105722, CurrSamplesPerSec=238.31463142017293, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:20:39,756] [INFO] [logging.py:96:log_dist] [Rank 0] step=2300, skipped=22, lr=[2.096709863815726e-05, 2.096709863815726e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:20:39,765] [INFO] [timer.py:199:stop] epoch=0/micro_step=2300/global_step=2300, RunningAvgSamplesPerSec=233.8528295237981, CurrSamplesPerSec=237.66160536739721, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:20:42,445] [INFO] [logging.py:96:log_dist] [Rank 0] step=2310, skipped=22, lr=[2.0779538760856436e-05, 2.0779538760856436e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:20:42,454] [INFO] [timer.py:199:stop] epoch=0/micro_step=2310/global_step=2310, RunningAvgSamplesPerSec=233.8724452293622, CurrSamplesPerSec=238.4314773249106, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:20:44,317] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:20:44,317] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-14 08:20:44,317] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:20:44,317] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-14 08:20:44,317] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:20:44,317] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:20:44,317] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-14 08:20:44,317] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:20:44,317] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-14 08:20:44,317] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:20:44,317] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:20:44,317] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-14 08:20:44,317] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-14 08:20:44,317] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-14 08:20:44,317] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:20:44,318] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-14 08:20:45,134] [INFO] [logging.py:96:log_dist] [Rank 0] step=2320, skipped=22, lr=[2.0592223090193212e-05, 2.0592223090193212e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:20:45,143] [INFO] [timer.py:199:stop] epoch=0/micro_step=2320/global_step=2320, RunningAvgSamplesPerSec=233.89183664715273, CurrSamplesPerSec=238.71241798679253, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:20:47,824] [INFO] [logging.py:96:log_dist] [Rank 0] step=2330, skipped=22, lr=[2.0405162464729406e-05, 2.0405162464729406e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:20:47,833] [INFO] [timer.py:199:stop] epoch=0/micro_step=2330/global_step=2330, RunningAvgSamplesPerSec=233.91085101372175, CurrSamplesPerSec=238.63326360781127, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:20:50,513] [INFO] [logging.py:96:log_dist] [Rank 0] step=2340, skipped=22, lr=[2.02183677082693e-05, 2.02183677082693e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:20:50,521] [INFO] [timer.py:199:stop] epoch=0/micro_step=2340/global_step=2340, RunningAvgSamplesPerSec=233.92997407367153, CurrSamplesPerSec=238.66827180405468, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:20:51,831] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2344 [2023-04-14 08:20:51,831] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2344 [2023-04-14 08:20:51,831] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2344 [2023-04-14 08:20:51,831] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2344 [2023-04-14 08:20:51,831] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2344 [2023-04-14 08:20:51,831] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2344 [2023-04-14 08:20:51,832] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-14 08:20:51,832] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-14 08:20:51,832] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-14 08:20:51,832] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-14 08:20:51,832] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2344 [2023-04-14 08:20:51,832] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-14 08:20:51,832] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-14 08:20:51,832] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2344 [2023-04-14 08:20:51,832] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 16384.0, reducing to 8192.0 [2023-04-14 08:20:51,832] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-14 08:20:51,832] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-14 08:20:53,169] [INFO] [logging.py:96:log_dist] [Rank 0] step=2350, skipped=23, lr=[2.0050488678940788e-05, 2.0050488678940788e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:20:53,178] [INFO] [timer.py:199:stop] epoch=0/micro_step=2350/global_step=2350, RunningAvgSamplesPerSec=233.96082321270197, CurrSamplesPerSec=237.46944090587402, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:20:55,857] [INFO] [logging.py:96:log_dist] [Rank 0] step=2360, skipped=23, lr=[1.986422883756718e-05, 1.986422883756718e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:20:55,866] [INFO] [timer.py:199:stop] epoch=0/micro_step=2360/global_step=2360, RunningAvgSamplesPerSec=233.97979733902068, CurrSamplesPerSec=238.92977869871677, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:20:58,543] [INFO] [logging.py:96:log_dist] [Rank 0] step=2370, skipped=23, lr=[1.9678266164994796e-05, 1.9678266164994796e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:20:58,552] [INFO] [timer.py:199:stop] epoch=0/micro_step=2370/global_step=2370, RunningAvgSamplesPerSec=233.99953509033898, CurrSamplesPerSec=238.7727208845165, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:21:01,233] [INFO] [logging.py:96:log_dist] [Rank 0] step=2380, skipped=23, lr=[1.9492611421497547e-05, 1.9492611421497547e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:21:01,377] [INFO] [timer.py:199:stop] epoch=0/micro_step=2380/global_step=2380, RunningAvgSamplesPerSec=233.96883390109102, CurrSamplesPerSec=158.31566829109485, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:21:04,085] [INFO] [logging.py:96:log_dist] [Rank 0] step=2390, skipped=23, lr=[1.9307275349531794e-05, 1.9307275349531794e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:21:04,094] [INFO] [timer.py:199:stop] epoch=0/micro_step=2390/global_step=2390, RunningAvgSamplesPerSec=233.9774694437971, CurrSamplesPerSec=238.43783092869216, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:21:06,775] [INFO] [logging.py:96:log_dist] [Rank 0] step=2400, skipped=23, lr=[1.912226867311475e-05, 1.912226867311475e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:21:06,784] [INFO] [timer.py:199:stop] epoch=0/micro_step=2400/global_step=2400, RunningAvgSamplesPerSec=233.9955782307424, CurrSamplesPerSec=236.20379477622117, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:21:09,505] [INFO] [logging.py:96:log_dist] [Rank 0] step=2410, skipped=23, lr=[1.8937602097203944e-05, 1.8937602097203944e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:21:09,514] [INFO] [timer.py:199:stop] epoch=0/micro_step=2410/global_step=2410, RunningAvgSamplesPerSec=233.99903089042954, CurrSamplesPerSec=238.58914669427915, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:21:12,197] [INFO] [logging.py:96:log_dist] [Rank 0] step=2420, skipped=23, lr=[1.875328630707785e-05, 1.875328630707785e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:21:12,206] [INFO] [timer.py:199:stop] epoch=0/micro_step=2420/global_step=2420, RunningAvgSamplesPerSec=234.0161216215927, CurrSamplesPerSec=237.7731780512261, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:21:14,892] [INFO] [logging.py:96:log_dist] [Rank 0] step=2430, skipped=23, lr=[1.8569331967717568e-05, 1.8569331967717568e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:21:15,140] [INFO] [timer.py:199:stop] epoch=0/micro_step=2430/global_step=2430, RunningAvgSamplesPerSec=233.94774985393286, CurrSamplesPerSec=125.78309125004569, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:21:17,820] [INFO] [logging.py:96:log_dist] [Rank 0] step=2440, skipped=23, lr=[1.8385749723189744e-05, 1.8385749723189744e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:21:17,829] [INFO] [timer.py:199:stop] epoch=0/micro_step=2440/global_step=2440, RunningAvgSamplesPerSec=233.96604633452827, CurrSamplesPerSec=238.64471970505844, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:21:19,421] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:21:19,421] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:21:19,421] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:21:19,421] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-14 08:21:19,421] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-14 08:21:19,421] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-14 08:21:19,421] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:21:19,421] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-14 08:21:19,421] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:21:19,421] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:21:19,421] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-14 08:21:19,421] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-14 08:21:19,421] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:21:19,421] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-14 08:21:19,421] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:21:19,422] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-14 08:21:20,510] [INFO] [logging.py:96:log_dist] [Rank 0] step=2450, skipped=23, lr=[1.8202550196030655e-05, 1.8202550196030655e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:21:20,519] [INFO] [timer.py:199:stop] epoch=0/micro_step=2450/global_step=2450, RunningAvgSamplesPerSec=233.9838597367916, CurrSamplesPerSec=238.48549375257645, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:21:23,196] [INFO] [logging.py:96:log_dist] [Rank 0] step=2460, skipped=23, lr=[1.8019743986631587e-05, 1.8019743986631587e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:21:23,205] [INFO] [timer.py:199:stop] epoch=0/micro_step=2460/global_step=2460, RunningAvgSamplesPerSec=234.0026974455502, CurrSamplesPerSec=238.5425023215722, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:21:25,884] [INFO] [logging.py:96:log_dist] [Rank 0] step=2470, skipped=23, lr=[1.7837341672625463e-05, 1.7837341672625463e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:21:26,617] [INFO] [timer.py:199:stop] epoch=0/micro_step=2470/global_step=2470, RunningAvgSamplesPerSec=233.76995778984585, CurrSamplesPerSec=64.50488252362443, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:21:29,320] [INFO] [logging.py:96:log_dist] [Rank 0] step=2480, skipped=23, lr=[1.7655353808274795e-05, 1.7655353808274795e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:21:29,329] [INFO] [timer.py:199:stop] epoch=0/micro_step=2480/global_step=2480, RunningAvgSamplesPerSec=233.78071952421772, CurrSamplesPerSec=238.81456861141976, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:21:32,011] [INFO] [logging.py:96:log_dist] [Rank 0] step=2490, skipped=23, lr=[1.7473790923860938e-05, 1.7473790923860938e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:21:32,020] [INFO] [timer.py:199:stop] epoch=0/micro_step=2490/global_step=2490, RunningAvgSamplesPerSec=233.79851563528365, CurrSamplesPerSec=239.07660930120298, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:21:34,701] [INFO] [logging.py:96:log_dist] [Rank 0] step=2500, skipped=23, lr=[1.7292663525074884e-05, 1.7292663525074884e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:21:34,710] [INFO] [timer.py:199:stop] epoch=0/micro_step=2500/global_step=2500, RunningAvgSamplesPerSec=233.81638652744434, CurrSamplesPerSec=238.67251591977254, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:21:37,398] [INFO] [logging.py:96:log_dist] [Rank 0] step=2510, skipped=23, lr=[1.7111982092409288e-05, 1.7111982092409288e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:21:37,919] [INFO] [timer.py:199:stop] epoch=0/micro_step=2510/global_step=2510, RunningAvgSamplesPerSec=233.65798828538777, CurrSamplesPerSec=82.01886918129459, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:21:40,652] [INFO] [logging.py:96:log_dist] [Rank 0] step=2520, skipped=23, lr=[1.693175708055207e-05, 1.693175708055207e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:21:40,661] [INFO] [timer.py:199:stop] epoch=0/micro_step=2520/global_step=2520, RunningAvgSamplesPerSec=233.6589217803339, CurrSamplesPerSec=230.92890669089206, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:21:43,338] [INFO] [logging.py:96:log_dist] [Rank 0] step=2530, skipped=23, lr=[1.67519989177815e-05, 1.67519989177815e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:21:43,347] [INFO] [timer.py:199:stop] epoch=0/micro_step=2530/global_step=2530, RunningAvgSamplesPerSec=233.67837204111154, CurrSamplesPerSec=238.80458330367946, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:21:46,027] [INFO] [logging.py:96:log_dist] [Rank 0] step=2540, skipped=23, lr=[1.657271800536272e-05, 1.657271800536272e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:21:46,036] [INFO] [timer.py:199:stop] epoch=0/micro_step=2540/global_step=2540, RunningAvgSamplesPerSec=233.69685761000807, CurrSamplesPerSec=239.05127346560752, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:21:47,636] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:21:47,636] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:21:47,636] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:21:47,636] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:21:47,636] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:21:47,636] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:21:47,636] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:21:47,636] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:21:47,636] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:21:47,636] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:21:47,636] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:21:47,636] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:21:47,636] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:21:47,637] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:21:47,636] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:21:47,637] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:21:48,723] [INFO] [logging.py:96:log_dist] [Rank 0] step=2550, skipped=23, lr=[1.6393924716946e-05, 1.6393924716946e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:21:48,732] [INFO] [timer.py:199:stop] epoch=0/micro_step=2550/global_step=2550, RunningAvgSamplesPerSec=233.71267092258412, CurrSamplesPerSec=238.3008799278445, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:21:49,243] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2551 [2023-04-14 08:21:49,243] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:21:49,243] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2551 [2023-04-14 08:21:49,243] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:21:49,243] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2551 [2023-04-14 08:21:49,243] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2551 [2023-04-14 08:21:49,243] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2551 [2023-04-14 08:21:49,243] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:21:49,243] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:21:49,243] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:21:49,243] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2551 [2023-04-14 08:21:49,243] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:21:49,243] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0 [2023-04-14 08:21:49,243] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2551 [2023-04-14 08:21:49,243] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2551 [2023-04-14 08:21:49,243] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:21:49,243] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:21:51,397] [INFO] [logging.py:96:log_dist] [Rank 0] step=2560, skipped=24, lr=[1.623343622698747e-05, 1.623343622698747e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:21:51,539] [INFO] [timer.py:199:stop] epoch=0/micro_step=2560/global_step=2560, RunningAvgSamplesPerSec=233.69152000283157, CurrSamplesPerSec=159.00577415632634, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:21:54,229] [INFO] [logging.py:96:log_dist] [Rank 0] step=2570, skipped=24, lr=[1.6055597902046095e-05, 1.6055597902046095e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:21:54,238] [INFO] [timer.py:199:stop] epoch=0/micro_step=2570/global_step=2570, RunningAvgSamplesPerSec=233.70665348118925, CurrSamplesPerSec=238.46388424480716, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:21:56,918] [INFO] [logging.py:96:log_dist] [Rank 0] step=2580, skipped=24, lr=[1.5878277122993152e-05, 1.5878277122993152e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:21:56,926] [INFO] [timer.py:199:stop] epoch=0/micro_step=2580/global_step=2580, RunningAvgSamplesPerSec=233.724922990624, CurrSamplesPerSec=238.15647936944777, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:21:59,609] [INFO] [logging.py:96:log_dist] [Rank 0] step=2590, skipped=24, lr=[1.5701484150060596e-05, 1.5701484150060596e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:21:59,618] [INFO] [timer.py:199:stop] epoch=0/micro_step=2590/global_step=2590, RunningAvgSamplesPerSec=233.74205062640777, CurrSamplesPerSec=238.65363076340404, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:22:02,305] [INFO] [logging.py:96:log_dist] [Rank 0] step=2600, skipped=24, lr=[1.5525229212940167e-05, 1.5525229212940167e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:22:03,064] [INFO] [timer.py:199:stop] epoch=0/micro_step=2600/global_step=2600, RunningAvgSamplesPerSec=233.5113053639106, CurrSamplesPerSec=62.82032460525426, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:22:05,751] [INFO] [logging.py:96:log_dist] [Rank 0] step=2610, skipped=24, lr=[1.5349522510191484e-05, 1.5349522510191484e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:22:05,760] [INFO] [timer.py:199:stop] epoch=0/micro_step=2610/global_step=2610, RunningAvgSamplesPerSec=233.52774739240925, CurrSamplesPerSec=239.04020403002033, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:22:08,450] [INFO] [logging.py:96:log_dist] [Rank 0] step=2620, skipped=24, lr=[1.5174374208651912e-05, 1.5174374208651912e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:22:08,459] [INFO] [timer.py:199:stop] epoch=0/micro_step=2620/global_step=2620, RunningAvgSamplesPerSec=233.5430545913528, CurrSamplesPerSec=238.5772718506227, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:22:11,140] [INFO] [logging.py:96:log_dist] [Rank 0] step=2630, skipped=24, lr=[1.4999794442848319e-05, 1.4999794442848319e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:22:11,149] [INFO] [timer.py:199:stop] epoch=0/micro_step=2630/global_step=2630, RunningAvgSamplesPerSec=233.56137144531783, CurrSamplesPerSec=238.8766237059963, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:22:13,857] [INFO] [logging.py:96:log_dist] [Rank 0] step=2640, skipped=24, lr=[1.4825793314410646e-05, 1.4825793314410646e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:22:14,309] [INFO] [timer.py:199:stop] epoch=0/micro_step=2640/global_step=2640, RunningAvgSamplesPerSec=233.42750178395175, CurrSamplesPerSec=90.01083944140294, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:22:16,991] [INFO] [logging.py:96:log_dist] [Rank 0] step=2650, skipped=24, lr=[1.4652380891487399e-05, 1.4652380891487399e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:22:17,000] [INFO] [timer.py:199:stop] epoch=0/micro_step=2650/global_step=2650, RunningAvgSamplesPerSec=233.4457066771361, CurrSamplesPerSec=238.7648627950478, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:22:17,785] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:22:17,785] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:22:17,785] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:22:17,785] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:22:17,785] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:22:17,785] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:22:17,785] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:22:17,786] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:22:17,786] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:22:17,786] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:22:17,786] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:22:17,786] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:22:17,786] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:22:17,786] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:22:17,786] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:22:17,786] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:22:18,854] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2656 [2023-04-14 08:22:18,854] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:22:18,854] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2656 [2023-04-14 08:22:18,854] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:22:18,854] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0 [2023-04-14 08:22:18,854] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2656 [2023-04-14 08:22:18,854] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:22:18,854] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2656 [2023-04-14 08:22:18,854] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:22:18,854] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2656 [2023-04-14 08:22:18,854] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2656 [2023-04-14 08:22:18,854] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:22:18,854] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:22:18,854] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2656 [2023-04-14 08:22:18,855] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:22:18,855] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2656 [2023-04-14 08:22:18,855] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:22:19,655] [INFO] [logging.py:96:log_dist] [Rank 0] step=2660, skipped=25, lr=[1.4496821347955359e-05, 1.4496821347955359e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:22:19,664] [INFO] [timer.py:199:stop] epoch=0/micro_step=2660/global_step=2660, RunningAvgSamplesPerSec=233.47214560421045, CurrSamplesPerSec=235.91897903815072, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:22:22,352] [INFO] [logging.py:96:log_dist] [Rank 0] step=2670, skipped=25, lr=[1.4324555080790523e-05, 1.4324555080790523e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:22:22,361] [INFO] [timer.py:199:stop] epoch=0/micro_step=2670/global_step=2670, RunningAvgSamplesPerSec=233.48791589938836, CurrSamplesPerSec=238.61692996398108, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:22:25,040] [INFO] [logging.py:96:log_dist] [Rank 0] step=2680, skipped=25, lr=[1.4152906522061048e-05, 1.4152906522061048e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:22:25,746] [INFO] [timer.py:199:stop] epoch=0/micro_step=2680/global_step=2680, RunningAvgSamplesPerSec=233.28504832240588, CurrSamplesPerSec=66.36550398981014, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:22:28,425] [INFO] [logging.py:96:log_dist] [Rank 0] step=2690, skipped=25, lr=[1.398188560378977e-05, 1.398188560378977e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:22:28,434] [INFO] [timer.py:199:stop] epoch=0/micro_step=2690/global_step=2690, RunningAvgSamplesPerSec=233.30410854369867, CurrSamplesPerSec=238.0407471222068, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:22:31,115] [INFO] [logging.py:96:log_dist] [Rank 0] step=2700, skipped=25, lr=[1.3811502221682643e-05, 1.3811502221682643e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:22:31,124] [INFO] [timer.py:199:stop] epoch=0/micro_step=2700/global_step=2700, RunningAvgSamplesPerSec=233.32266659056413, CurrSamplesPerSec=238.52830061259198, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:22:33,821] [INFO] [logging.py:96:log_dist] [Rank 0] step=2710, skipped=25, lr=[1.3641766234556146e-05, 1.3641766234556146e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:22:33,829] [INFO] [timer.py:199:stop] epoch=0/micro_step=2710/global_step=2710, RunningAvgSamplesPerSec=233.33621267508317, CurrSamplesPerSec=239.0461643679471, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:22:36,508] [INFO] [logging.py:96:log_dist] [Rank 0] step=2720, skipped=25, lr=[1.347268746376685e-05, 1.347268746376685e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:22:37,008] [INFO] [timer.py:199:stop] epoch=0/micro_step=2720/global_step=2720, RunningAvgSamplesPerSec=233.20157003054155, CurrSamplesPerSec=84.28130665238302, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:22:39,688] [INFO] [logging.py:96:log_dist] [Rank 0] step=2730, skipped=25, lr=[1.3304275692643118e-05, 1.3304275692643118e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:22:39,697] [INFO] [timer.py:199:stop] epoch=0/micro_step=2730/global_step=2730, RunningAvgSamplesPerSec=233.2204742079708, CurrSamplesPerSec=238.2272835382196, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:22:42,391] [INFO] [logging.py:96:log_dist] [Rank 0] step=2740, skipped=25, lr=[1.3136540665918978e-05, 1.3136540665918978e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:22:42,400] [INFO] [timer.py:199:stop] epoch=0/micro_step=2740/global_step=2740, RunningAvgSamplesPerSec=233.23495103001966, CurrSamplesPerSec=238.48655314287134, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:22:45,076] [INFO] [logging.py:96:log_dist] [Rank 0] step=2750, skipped=25, lr=[1.2969492089170343e-05, 1.2969492089170343e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:22:45,085] [INFO] [timer.py:199:stop] epoch=0/micro_step=2750/global_step=2750, RunningAvgSamplesPerSec=233.2548804918099, CurrSamplesPerSec=239.39344073699183, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:22:47,287] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:22:47,287] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:22:47,288] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:22:47,288] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:22:47,288] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:22:47,288] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:22:47,288] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:22:47,288] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:22:47,288] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:22:47,288] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:22:47,288] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:22:47,288] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:22:47,288] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:22:47,288] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:22:47,288] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:22:47,288] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:22:47,836] [INFO] [logging.py:96:log_dist] [Rank 0] step=2760, skipped=25, lr=[1.2803139628253364e-05, 1.2803139628253364e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:22:47,845] [INFO] [timer.py:199:stop] epoch=0/micro_step=2760/global_step=2760, RunningAvgSamplesPerSec=233.2515990739609, CurrSamplesPerSec=239.74900772740102, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:22:48,623] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2762 [2023-04-14 08:22:48,623] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:22:48,623] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2762 [2023-04-14 08:22:48,623] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:22:48,623] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2762 [2023-04-14 08:22:48,623] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2762 [2023-04-14 08:22:48,623] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2762 [2023-04-14 08:22:48,623] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:22:48,623] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:22:48,623] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:22:48,623] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2762 [2023-04-14 08:22:48,623] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:22:48,623] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2762 [2023-04-14 08:22:48,623] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2762 [2023-04-14 08:22:48,623] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:22:48,623] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:22:48,623] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0 [2023-04-14 08:22:50,496] [INFO] [logging.py:96:log_dist] [Rank 0] step=2770, skipped=26, lr=[1.26540255488449e-05, 1.26540255488449e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:22:50,505] [INFO] [timer.py:199:stop] epoch=0/micro_step=2770/global_step=2770, RunningAvgSamplesPerSec=233.27903349010398, CurrSamplesPerSec=239.4335842415712, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:22:53,186] [INFO] [logging.py:96:log_dist] [Rank 0] step=2780, skipped=26, lr=[1.2489022192733554e-05, 1.2489022192733554e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:22:53,195] [INFO] [timer.py:199:stop] epoch=0/micro_step=2780/global_step=2780, RunningAvgSamplesPerSec=233.29709893753486, CurrSamplesPerSec=238.44630292715468, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:22:55,872] [INFO] [logging.py:96:log_dist] [Rank 0] step=2790, skipped=26, lr=[1.2324742753665924e-05, 1.2324742753665924e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:22:55,881] [INFO] [timer.py:199:stop] epoch=0/micro_step=2790/global_step=2790, RunningAvgSamplesPerSec=233.3159465513026, CurrSamplesPerSec=238.28501479320144, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:22:58,566] [INFO] [logging.py:96:log_dist] [Rank 0] step=2800, skipped=26, lr=[1.21611967372688e-05, 1.21611967372688e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:22:58,631] [INFO] [timer.py:199:stop] epoch=0/micro_step=2800/global_step=2800, RunningAvgSamplesPerSec=233.31562794176475, CurrSamplesPerSec=197.69910988232417, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:23:01,307] [INFO] [logging.py:96:log_dist] [Rank 0] step=2810, skipped=26, lr=[1.199839360673127e-05, 1.199839360673127e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:23:01,316] [INFO] [timer.py:199:stop] epoch=0/micro_step=2810/global_step=2810, RunningAvgSamplesPerSec=233.33493181558367, CurrSamplesPerSec=238.99125535745128, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:23:03,996] [INFO] [logging.py:96:log_dist] [Rank 0] step=2820, skipped=26, lr=[1.1836342782257165e-05, 1.1836342782257165e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:23:04,005] [INFO] [timer.py:199:stop] epoch=0/micro_step=2820/global_step=2820, RunningAvgSamplesPerSec=233.35266601552877, CurrSamplesPerSec=238.57260705687037, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:23:06,695] [INFO] [logging.py:96:log_dist] [Rank 0] step=2830, skipped=26, lr=[1.1675053640519953e-05, 1.1675053640519953e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:23:06,704] [INFO] [timer.py:199:stop] epoch=0/micro_step=2830/global_step=2830, RunningAvgSamplesPerSec=233.36748208144414, CurrSamplesPerSec=239.3434936859647, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:23:09,387] [INFO] [logging.py:96:log_dist] [Rank 0] step=2840, skipped=26, lr=[1.1514535514120217e-05, 1.1514535514120217e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:23:09,396] [INFO] [timer.py:199:stop] epoch=0/micro_step=2840/global_step=2840, RunningAvgSamplesPerSec=233.38418421024835, CurrSamplesPerSec=233.04237789897158, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:23:12,074] [INFO] [logging.py:96:log_dist] [Rank 0] step=2850, skipped=26, lr=[1.1354797691045615e-05, 1.1354797691045615e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:23:12,083] [INFO] [timer.py:199:stop] epoch=0/micro_step=2850/global_step=2850, RunningAvgSamplesPerSec=233.4022618258508, CurrSamplesPerSec=238.9346701463682, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:23:14,777] [INFO] [logging.py:96:log_dist] [Rank 0] step=2860, skipped=26, lr=[1.1195849414133481e-05, 1.1195849414133481e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:23:14,786] [INFO] [timer.py:199:stop] epoch=0/micro_step=2860/global_step=2860, RunningAvgSamplesPerSec=233.4154722998436, CurrSamplesPerSec=231.28585878951114, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:23:15,846] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:23:15,846] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:23:15,846] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:23:15,846] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:23:15,846] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:23:15,846] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:23:15,846] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:23:15,846] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:23:15,846] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:23:15,846] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:23:15,846] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:23:15,846] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:23:15,847] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:23:15,847] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:23:15,847] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:23:15,847] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:23:16,110] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2864 [2023-04-14 08:23:16,110] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:23:16,110] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2864 [2023-04-14 08:23:16,110] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:23:16,110] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2864 [2023-04-14 08:23:16,110] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:23:16,110] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2864 [2023-04-14 08:23:16,110] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2864 [2023-04-14 08:23:16,110] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:23:16,110] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:23:16,110] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2864 [2023-04-14 08:23:16,110] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:23:16,110] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2864 [2023-04-14 08:23:16,110] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2864 [2023-04-14 08:23:16,110] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:23:16,111] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:23:16,111] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0 [2023-04-14 08:23:17,444] [INFO] [logging.py:96:log_dist] [Rank 0] step=2870, skipped=27, lr=[1.1053478629280279e-05, 1.1053478629280279e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:23:17,453] [INFO] [timer.py:199:stop] epoch=0/micro_step=2870/global_step=2870, RunningAvgSamplesPerSec=233.43935199902282, CurrSamplesPerSec=238.8645076784262, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:23:20,151] [INFO] [logging.py:96:log_dist] [Rank 0] step=2880, skipped=27, lr=[1.0896055790043382e-05, 1.0896055790043382e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:23:20,160] [INFO] [timer.py:199:stop] epoch=0/micro_step=2880/global_step=2880, RunningAvgSamplesPerSec=233.4513121639778, CurrSamplesPerSec=238.56433809806506, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:23:22,840] [INFO] [logging.py:96:log_dist] [Rank 0] step=2890, skipped=27, lr=[1.073944904094385e-05, 1.073944904094385e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:23:22,849] [INFO] [timer.py:199:stop] epoch=0/micro_step=2890/global_step=2890, RunningAvgSamplesPerSec=233.4680242942644, CurrSamplesPerSec=238.89214251896468, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:23:25,549] [INFO] [logging.py:96:log_dist] [Rank 0] step=2900, skipped=27, lr=[1.0583667443647067e-05, 1.0583667443647067e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:23:25,557] [INFO] [timer.py:199:stop] epoch=0/micro_step=2900/global_step=2900, RunningAvgSamplesPerSec=233.47928115466695, CurrSamplesPerSec=222.86751761367316, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:23:28,244] [INFO] [logging.py:96:log_dist] [Rank 0] step=2910, skipped=27, lr=[1.0428720012073062e-05, 1.0428720012073062e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:23:28,253] [INFO] [timer.py:199:stop] epoch=0/micro_step=2910/global_step=2910, RunningAvgSamplesPerSec=233.4940961279786, CurrSamplesPerSec=239.0480802539773, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:23:30,936] [INFO] [logging.py:96:log_dist] [Rank 0] step=2920, skipped=27, lr=[1.0274615711874925e-05, 1.0274615711874925e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:23:31,113] [INFO] [timer.py:199:stop] epoch=0/micro_step=2920/global_step=2920, RunningAvgSamplesPerSec=233.46116868707892, CurrSamplesPerSec=144.55095286559722, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:23:33,793] [INFO] [logging.py:96:log_dist] [Rank 0] step=2930, skipped=27, lr=[1.0121363459919986e-05, 1.0121363459919986e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:23:33,802] [INFO] [timer.py:199:stop] epoch=0/micro_step=2930/global_step=2930, RunningAvgSamplesPerSec=233.47784874218516, CurrSamplesPerSec=235.9407518871009, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:23:36,529] [INFO] [logging.py:96:log_dist] [Rank 0] step=2940, skipped=27, lr=[9.968972123773915e-06, 9.968972123773915e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:23:36,538] [INFO] [timer.py:199:stop] epoch=0/micro_step=2940/global_step=2940, RunningAvgSamplesPerSec=233.48100455632417, CurrSamplesPerSec=221.92616721135013, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:23:39,225] [INFO] [logging.py:96:log_dist] [Rank 0] step=2950, skipped=27, lr=[9.817450521187574e-06, 9.817450521187574e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:23:39,234] [INFO] [timer.py:199:stop] epoch=0/micro_step=2950/global_step=2950, RunningAvgSamplesPerSec=233.4955609905403, CurrSamplesPerSec=236.0554885764762, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:23:41,928] [INFO] [logging.py:96:log_dist] [Rank 0] step=2960, skipped=27, lr=[9.666807419586848e-06, 9.666807419586848e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:23:41,937] [INFO] [timer.py:199:stop] epoch=0/micro_step=2960/global_step=2960, RunningAvgSamplesPerSec=233.50802805574187, CurrSamplesPerSec=232.2142178684443, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:23:43,529] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:23:43,529] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:23:43,529] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:23:43,529] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:23:43,529] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:23:43,529] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:23:43,530] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:23:43,530] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:23:43,530] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:23:43,530] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:23:43,530] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:23:43,530] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:23:43,530] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:23:43,530] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:23:43,531] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:23:43,531] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:23:44,062] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2967 [2023-04-14 08:23:44,062] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:23:44,062] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2967 [2023-04-14 08:23:44,063] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:23:44,063] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2967 [2023-04-14 08:23:44,062] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2967 [2023-04-14 08:23:44,062] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2967 [2023-04-14 08:23:44,063] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:23:44,063] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2967 [2023-04-14 08:23:44,063] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:23:44,063] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:23:44,063] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2967 [2023-04-14 08:23:44,063] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:23:44,063] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0 [2023-04-14 08:23:44,063] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:23:44,063] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 2967 [2023-04-14 08:23:44,063] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:23:44,591] [INFO] [logging.py:96:log_dist] [Rank 0] step=2970, skipped=28, lr=[9.531986951811952e-06, 9.531986951811952e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:23:44,600] [INFO] [timer.py:199:stop] epoch=0/micro_step=2970/global_step=2970, RunningAvgSamplesPerSec=233.53163853622172, CurrSamplesPerSec=239.03105123261932, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:23:47,295] [INFO] [logging.py:96:log_dist] [Rank 0] step=2980, skipped=28, lr=[9.383036973879814e-06, 9.383036973879814e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:23:47,304] [INFO] [timer.py:199:stop] epoch=0/micro_step=2980/global_step=2980, RunningAvgSamplesPerSec=233.5438418149766, CurrSamplesPerSec=234.8496128642794, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:23:49,990] [INFO] [logging.py:96:log_dist] [Rank 0] step=2990, skipped=28, lr=[9.234990633207672e-06, 9.234990633207672e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:23:49,999] [INFO] [timer.py:199:stop] epoch=0/micro_step=2990/global_step=2990, RunningAvgSamplesPerSec=233.5584709405588, CurrSamplesPerSec=238.42597114218844, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:23:52,715] [INFO] [logging.py:96:log_dist] [Rank 0] step=3000, skipped=28, lr=[9.087856496133895e-06, 9.087856496133895e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:23:52,724] [INFO] [timer.py:199:stop] epoch=0/micro_step=3000/global_step=3000, RunningAvgSamplesPerSec=233.5642739120391, CurrSamplesPerSec=238.57684777092538, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:23:55,411] [INFO] [logging.py:96:log_dist] [Rank 0] step=3010, skipped=28, lr=[8.941643076214437e-06, 8.941643076214437e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:23:55,419] [INFO] [timer.py:199:stop] epoch=0/micro_step=3010/global_step=3010, RunningAvgSamplesPerSec=233.5783019778221, CurrSamplesPerSec=238.7797298871995, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:23:58,103] [INFO] [logging.py:96:log_dist] [Rank 0] step=3020, skipped=28, lr=[8.79635883373021e-06, 8.79635883373021e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:23:58,112] [INFO] [timer.py:199:stop] epoch=0/micro_step=3020/global_step=3020, RunningAvgSamplesPerSec=233.59310302780844, CurrSamplesPerSec=235.43479606970544, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:24:00,795] [INFO] [logging.py:96:log_dist] [Rank 0] step=3030, skipped=28, lr=[8.652012175197541e-06, 8.652012175197541e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:24:00,803] [INFO] [timer.py:199:stop] epoch=0/micro_step=3030/global_step=3030, RunningAvgSamplesPerSec=233.60818664064428, CurrSamplesPerSec=238.56009782871445, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:24:03,490] [INFO] [logging.py:96:log_dist] [Rank 0] step=3040, skipped=28, lr=[8.50861145288178e-06, 8.50861145288178e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:24:03,532] [INFO] [timer.py:199:stop] epoch=0/micro_step=3040/global_step=3040, RunningAvgSamplesPerSec=233.61273594845596, CurrSamplesPerSec=209.76010177178551, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:24:06,222] [INFO] [logging.py:96:log_dist] [Rank 0] step=3050, skipped=28, lr=[8.36616496431398e-06, 8.36616496431398e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:24:06,231] [INFO] [timer.py:199:stop] epoch=0/micro_step=3050/global_step=3050, RunningAvgSamplesPerSec=233.62569613665633, CurrSamplesPerSec=239.0951355376347, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:24:08,911] [INFO] [logging.py:96:log_dist] [Rank 0] step=3060, skipped=28, lr=[8.224680951810821e-06, 8.224680951810821e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:24:08,920] [INFO] [timer.py:199:stop] epoch=0/micro_step=3060/global_step=3060, RunningAvgSamplesPerSec=233.64117310591178, CurrSamplesPerSec=238.82327988099559, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:24:11,321] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:24:11,321] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:24:11,321] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:24:11,321] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:24:11,321] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:24:11,321] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:24:11,321] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:24:11,321] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:24:11,321] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:24:11,321] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:24:11,322] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:24:11,321] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:24:11,322] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:24:11,322] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:24:11,321] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:24:11,322] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:24:11,583] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3069 [2023-04-14 08:24:11,583] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:24:11,583] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3069 [2023-04-14 08:24:11,583] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3069 [2023-04-14 08:24:11,583] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:24:11,583] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:24:11,584] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0 [2023-04-14 08:24:11,583] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3069 [2023-04-14 08:24:11,584] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:24:11,584] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3069 [2023-04-14 08:24:11,584] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:24:11,584] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3069 [2023-04-14 08:24:11,584] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:24:11,584] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3069 [2023-04-14 08:24:11,584] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3069 [2023-04-14 08:24:11,584] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:24:11,584] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:24:11,584] [INFO] [logging.py:96:log_dist] [Rank 0] step=3070, skipped=29, lr=[8.098175024997445e-06, 8.098175024997445e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:24:11,584] [INFO] [timer.py:199:stop] epoch=0/micro_step=3070/global_step=3070, RunningAvgSamplesPerSec=233.66342688327882, CurrSamplesPerSec=264.14676905758114, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:24:14,266] [INFO] [logging.py:96:log_dist] [Rank 0] step=3080, skipped=29, lr=[7.958542224759266e-06, 7.958542224759266e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:24:14,273] [INFO] [timer.py:199:stop] epoch=0/micro_step=3080/global_step=3080, RunningAvgSamplesPerSec=233.67876896844277, CurrSamplesPerSec=235.3072536766285, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:24:16,955] [INFO] [logging.py:96:log_dist] [Rank 0] step=3090, skipped=29, lr=[7.819895486675718e-06, 7.819895486675718e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:24:16,964] [INFO] [timer.py:199:stop] epoch=0/micro_step=3090/global_step=3090, RunningAvgSamplesPerSec=233.69344906108284, CurrSamplesPerSec=239.47203212997582, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:24:19,649] [INFO] [logging.py:96:log_dist] [Rank 0] step=3100, skipped=29, lr=[7.682242833200234e-06, 7.682242833200234e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:24:19,658] [INFO] [timer.py:199:stop] epoch=0/micro_step=3100/global_step=3100, RunningAvgSamplesPerSec=233.70732878355057, CurrSamplesPerSec=238.461342207235, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:24:22,387] [INFO] [logging.py:96:log_dist] [Rank 0] step=3110, skipped=29, lr=[7.545592229265961e-06, 7.545592229265961e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:24:22,396] [INFO] [timer.py:199:stop] epoch=0/micro_step=3110/global_step=3110, RunningAvgSamplesPerSec=233.70912649701066, CurrSamplesPerSec=220.0597263553118, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:24:25,072] [INFO] [logging.py:96:log_dist] [Rank 0] step=3120, skipped=29, lr=[7.409951581824914e-06, 7.409951581824914e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:24:25,080] [INFO] [timer.py:199:stop] epoch=0/micro_step=3120/global_step=3120, RunningAvgSamplesPerSec=233.72526350674838, CurrSamplesPerSec=238.904686594714, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:24:27,757] [INFO] [logging.py:96:log_dist] [Rank 0] step=3130, skipped=29, lr=[7.275328739390466e-06, 7.275328739390466e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:24:27,766] [INFO] [timer.py:199:stop] epoch=0/micro_step=3130/global_step=3130, RunningAvgSamplesPerSec=233.74103394060236, CurrSamplesPerSec=238.5425023215722, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:24:30,445] [INFO] [logging.py:96:log_dist] [Rank 0] step=3140, skipped=29, lr=[7.141731491583187e-06, 7.141731491583187e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:24:30,454] [INFO] [timer.py:199:stop] epoch=0/micro_step=3140/global_step=3140, RunningAvgSamplesPerSec=233.7561588683288, CurrSamplesPerSec=236.16493466731887, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:24:33,130] [INFO] [logging.py:96:log_dist] [Rank 0] step=3150, skipped=29, lr=[7.0091675686801275e-06, 7.0091675686801275e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:24:33,139] [INFO] [timer.py:199:stop] epoch=0/micro_step=3150/global_step=3150, RunningAvgSamplesPerSec=233.77181300494328, CurrSamplesPerSec=238.74872347673264, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:24:35,816] [INFO] [logging.py:96:log_dist] [Rank 0] step=3160, skipped=29, lr=[6.877644641167535e-06, 6.877644641167535e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:24:35,825] [INFO] [timer.py:199:stop] epoch=0/micro_step=3160/global_step=3160, RunningAvgSamplesPerSec=233.7872437728171, CurrSamplesPerSec=237.68475335052275, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:24:38,508] [INFO] [logging.py:96:log_dist] [Rank 0] step=3170, skipped=29, lr=[6.747170319297011e-06, 6.747170319297011e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:24:38,893] [INFO] [timer.py:199:stop] epoch=0/micro_step=3170/global_step=3170, RunningAvgSamplesPerSec=233.70028074594615, CurrSamplesPerSec=99.62776501048106, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:24:39,140] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:24:39,141] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:24:39,141] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:24:39,141] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:24:39,141] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:24:39,141] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:24:39,141] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:24:39,141] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:24:39,141] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:24:39,141] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:24:39,141] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:24:39,141] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:24:39,141] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:24:39,141] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:24:39,141] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:24:39,141] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:24:41,595] [INFO] [logging.py:96:log_dist] [Rank 0] step=3180, skipped=29, lr=[6.617752152645165e-06, 6.617752152645165e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:24:41,604] [INFO] [timer.py:199:stop] epoch=0/micro_step=3180/global_step=3180, RunningAvgSamplesPerSec=233.70903214582432, CurrSamplesPerSec=238.93722228551897, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:24:42,932] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3184 [2023-04-14 08:24:42,932] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:24:42,932] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3184 [2023-04-14 08:24:42,932] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:24:42,932] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3184 [2023-04-14 08:24:42,932] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:24:42,932] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3184 [2023-04-14 08:24:42,932] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0 [2023-04-14 08:24:42,932] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3184 [2023-04-14 08:24:42,932] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3184 [2023-04-14 08:24:42,932] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:24:42,932] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3184 [2023-04-14 08:24:42,932] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:24:42,932] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:24:42,932] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:24:42,932] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3184 [2023-04-14 08:24:42,933] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:24:44,266] [INFO] [logging.py:96:log_dist] [Rank 0] step=3190, skipped=30, lr=[6.502185005856312e-06, 6.502185005856312e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:24:44,275] [INFO] [timer.py:199:stop] epoch=0/micro_step=3190/global_step=3190, RunningAvgSamplesPerSec=233.72860203341153, CurrSamplesPerSec=238.97125601800417, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:24:46,955] [INFO] [logging.py:96:log_dist] [Rank 0] step=3200, skipped=30, lr=[6.374794113982233e-06, 6.374794113982233e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:24:46,964] [INFO] [timer.py:199:stop] epoch=0/micro_step=3200/global_step=3200, RunningAvgSamplesPerSec=233.74307149289163, CurrSamplesPerSec=238.88767799631393, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:24:49,647] [INFO] [logging.py:96:log_dist] [Rank 0] step=3210, skipped=30, lr=[6.248480923962577e-06, 6.248480923962577e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:24:49,656] [INFO] [timer.py:199:stop] epoch=0/micro_step=3210/global_step=3210, RunningAvgSamplesPerSec=233.75673936515108, CurrSamplesPerSec=238.8596191030422, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:24:52,335] [INFO] [logging.py:96:log_dist] [Rank 0] step=3220, skipped=30, lr=[6.123252744600266e-06, 6.123252744600266e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:24:52,344] [INFO] [timer.py:199:stop] epoch=0/micro_step=3220/global_step=3220, RunningAvgSamplesPerSec=233.77132590868462, CurrSamplesPerSec=237.52260192700595, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:24:55,032] [INFO] [logging.py:96:log_dist] [Rank 0] step=3230, skipped=30, lr=[5.999116821916728e-06, 5.999116821916728e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:24:55,041] [INFO] [timer.py:199:stop] epoch=0/micro_step=3230/global_step=3230, RunningAvgSamplesPerSec=233.78354730715353, CurrSamplesPerSec=238.74617536276355, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:24:57,758] [INFO] [logging.py:96:log_dist] [Rank 0] step=3240, skipped=30, lr=[5.876080338732643e-06, 5.876080338732643e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:24:57,767] [INFO] [timer.py:199:stop] epoch=0/micro_step=3240/global_step=3240, RunningAvgSamplesPerSec=233.7882184184676, CurrSamplesPerSec=238.76188959877467, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:25:00,459] [INFO] [logging.py:96:log_dist] [Rank 0] step=3250, skipped=30, lr=[5.75415041425234e-06, 5.75415041425234e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:25:00,468] [INFO] [timer.py:199:stop] epoch=0/micro_step=3250/global_step=3250, RunningAvgSamplesPerSec=233.7991676686254, CurrSamplesPerSec=238.34404378416198, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:25:03,152] [INFO] [logging.py:96:log_dist] [Rank 0] step=3260, skipped=30, lr=[5.63333410365183e-06, 5.63333410365183e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:25:03,161] [INFO] [timer.py:199:stop] epoch=0/micro_step=3260/global_step=3260, RunningAvgSamplesPerSec=233.81210631311066, CurrSamplesPerSec=239.71539432171022, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:25:05,849] [INFO] [logging.py:96:log_dist] [Rank 0] step=3270, skipped=30, lr=[5.5136383976705675e-06, 5.5136383976705675e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:25:05,858] [INFO] [timer.py:199:stop] epoch=0/micro_step=3270/global_step=3270, RunningAvgSamplesPerSec=233.82421823985302, CurrSamplesPerSec=234.46806711679056, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:25:08,535] [INFO] [logging.py:96:log_dist] [Rank 0] step=3280, skipped=30, lr=[5.395070222207e-06, 5.395070222207e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:25:08,544] [INFO] [timer.py:199:stop] epoch=0/micro_step=3280/global_step=3280, RunningAvgSamplesPerSec=233.83874918670622, CurrSamplesPerSec=239.2411033939408, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:25:10,135] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:25:10,135] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:25:10,135] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:25:10,135] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:25:10,135] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:25:10,135] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:25:10,135] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:25:10,135] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:25:10,135] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:25:10,135] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:25:10,136] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:25:10,136] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:25:10,136] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:25:10,136] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:25:10,136] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:25:10,136] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:25:10,665] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3287 [2023-04-14 08:25:10,665] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:25:10,665] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3287 [2023-04-14 08:25:10,665] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:25:10,665] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3287 [2023-04-14 08:25:10,665] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:25:10,665] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3287 [2023-04-14 08:25:10,665] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3287 [2023-04-14 08:25:10,665] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:25:10,665] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:25:10,665] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3287 [2023-04-14 08:25:10,665] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:25:10,665] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3287 [2023-04-14 08:25:10,665] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:25:10,665] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0 [2023-04-14 08:25:10,665] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3287 [2023-04-14 08:25:10,665] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:25:11,193] [INFO] [logging.py:96:log_dist] [Rank 0] step=3290, skipped=31, lr=[5.289328574569599e-06, 5.289328574569599e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:25:11,202] [INFO] [timer.py:199:stop] epoch=0/micro_step=3290/global_step=3290, RunningAvgSamplesPerSec=233.86065096749093, CurrSamplesPerSec=238.6599962125287, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:25:13,887] [INFO] [logging.py:96:log_dist] [Rank 0] step=3300, skipped=31, lr=[5.172921553956417e-06, 5.172921553956417e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:25:13,896] [INFO] [timer.py:199:stop] epoch=0/micro_step=3300/global_step=3300, RunningAvgSamplesPerSec=233.8731509004929, CurrSamplesPerSec=239.07937740695482, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:25:16,583] [INFO] [logging.py:96:log_dist] [Rank 0] step=3310, skipped=31, lr=[5.0576617786053684e-06, 5.0576617786053684e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:25:16,591] [INFO] [timer.py:199:stop] epoch=0/micro_step=3310/global_step=3310, RunningAvgSamplesPerSec=233.88514634956002, CurrSamplesPerSec=238.70456386593406, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:25:19,273] [INFO] [logging.py:96:log_dist] [Rank 0] step=3320, skipped=31, lr=[4.9435559177406495e-06, 4.9435559177406495e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:25:19,567] [INFO] [timer.py:199:stop] epoch=0/micro_step=3320/global_step=3320, RunningAvgSamplesPerSec=233.82481290170895, CurrSamplesPerSec=115.62624231180588, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:25:22,251] [INFO] [logging.py:96:log_dist] [Rank 0] step=3330, skipped=31, lr=[4.8306105738180145e-06, 4.8306105738180145e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:25:22,260] [INFO] [timer.py:199:stop] epoch=0/micro_step=3330/global_step=3330, RunningAvgSamplesPerSec=233.8375557806944, CurrSamplesPerSec=239.07724808914872, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:25:24,940] [INFO] [logging.py:96:log_dist] [Rank 0] step=3340, skipped=31, lr=[4.71883228214276e-06, 4.71883228214276e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:25:24,949] [INFO] [timer.py:199:stop] epoch=0/micro_step=3340/global_step=3340, RunningAvgSamplesPerSec=233.85130552355142, CurrSamplesPerSec=238.58469348938243, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:25:27,690] [INFO] [logging.py:96:log_dist] [Rank 0] step=3350, skipped=31, lr=[4.608227510491561e-06, 4.608227510491561e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:25:27,699] [INFO] [timer.py:199:stop] epoch=0/micro_step=3350/global_step=3350, RunningAvgSamplesPerSec=233.85079103976076, CurrSamplesPerSec=233.74226092256785, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:25:30,382] [INFO] [logging.py:96:log_dist] [Rank 0] step=3360, skipped=31, lr=[4.498802658738235e-06, 4.498802658738235e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:25:30,391] [INFO] [timer.py:199:stop] epoch=0/micro_step=3360/global_step=3360, RunningAvgSamplesPerSec=233.8635460249464, CurrSamplesPerSec=238.8103194422327, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:25:33,082] [INFO] [logging.py:96:log_dist] [Rank 0] step=3370, skipped=31, lr=[4.390564058483429e-06, 4.390564058483429e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:25:33,091] [INFO] [timer.py:199:stop] epoch=0/micro_step=3370/global_step=3370, RunningAvgSamplesPerSec=233.874024140837, CurrSamplesPerSec=237.22872845758948, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:25:35,772] [INFO] [logging.py:96:log_dist] [Rank 0] step=3380, skipped=31, lr=[4.283517972688261e-06, 4.283517972688261e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:25:35,781] [INFO] [timer.py:199:stop] epoch=0/micro_step=3380/global_step=3380, RunningAvgSamplesPerSec=233.88702384695074, CurrSamplesPerSec=239.3155409742529, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:25:38,191] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:25:38,191] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:25:38,191] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:25:38,191] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:25:38,191] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:25:38,191] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:25:38,192] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:25:38,192] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:25:38,191] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:25:38,191] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:25:38,192] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:25:38,192] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:25:38,192] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:25:38,192] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:25:38,192] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:25:38,192] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:25:38,471] [INFO] [logging.py:96:log_dist] [Rank 0] step=3390, skipped=31, lr=[4.1776705953119275e-06, 4.1776705953119275e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:25:38,480] [INFO] [timer.py:199:stop] epoch=0/micro_step=3390/global_step=3390, RunningAvgSamplesPerSec=233.89784660411493, CurrSamplesPerSec=239.14561812847836, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:25:38,990] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3391 [2023-04-14 08:25:38,990] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3391 [2023-04-14 08:25:38,990] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3391 [2023-04-14 08:25:38,990] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3391 [2023-04-14 08:25:38,990] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:25:38,990] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:25:38,990] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:25:38,990] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:25:38,990] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0 [2023-04-14 08:25:38,990] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3391 [2023-04-14 08:25:38,990] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3391 [2023-04-14 08:25:38,990] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:25:38,990] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:25:38,990] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3391 [2023-04-14 08:25:38,990] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3391 [2023-04-14 08:25:38,990] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:25:38,990] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:25:41,134] [INFO] [logging.py:96:log_dist] [Rank 0] step=3400, skipped=32, lr=[4.083437914791161e-06, 4.083437914791161e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:25:41,143] [INFO] [timer.py:199:stop] epoch=0/micro_step=3400/global_step=3400, RunningAvgSamplesPerSec=233.91756409777585, CurrSamplesPerSec=239.52609365268307, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:25:43,845] [INFO] [logging.py:96:log_dist] [Rank 0] step=3410, skipped=32, lr=[3.979884899068523e-06, 3.979884899068523e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:25:44,324] [INFO] [timer.py:199:stop] epoch=0/micro_step=3410/global_step=3410, RunningAvgSamplesPerSec=233.80753150209478, CurrSamplesPerSec=86.66062834215967, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:25:47,026] [INFO] [logging.py:96:log_dist] [Rank 0] step=3420, skipped=32, lr=[3.877548160747768e-06, 3.877548160747768e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:25:47,034] [INFO] [timer.py:199:stop] epoch=0/micro_step=3420/global_step=3420, RunningAvgSamplesPerSec=233.81702475023076, CurrSamplesPerSec=237.14741217472735, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:25:49,717] [INFO] [logging.py:96:log_dist] [Rank 0] step=3430, skipped=32, lr=[3.776433621293232e-06, 3.776433621293232e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:25:49,726] [INFO] [timer.py:199:stop] epoch=0/micro_step=3430/global_step=3430, RunningAvgSamplesPerSec=233.82961789685334, CurrSamplesPerSec=239.20081125057698, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:25:52,408] [INFO] [logging.py:96:log_dist] [Rank 0] step=3440, skipped=32, lr=[3.676547131449731e-06, 3.676547131449731e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:25:52,417] [INFO] [timer.py:199:stop] epoch=0/micro_step=3440/global_step=3440, RunningAvgSamplesPerSec=233.84232417245383, CurrSamplesPerSec=236.08351538561664, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:25:55,107] [INFO] [logging.py:96:log_dist] [Rank 0] step=3450, skipped=32, lr=[3.5778944709040003e-06, 3.5778944709040003e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:25:55,740] [INFO] [timer.py:199:stop] epoch=0/micro_step=3450/global_step=3450, RunningAvgSamplesPerSec=233.69850393603835, CurrSamplesPerSec=71.73124324769209, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:25:58,425] [INFO] [logging.py:96:log_dist] [Rank 0] step=3460, skipped=32, lr=[3.4804813479502623e-06, 3.4804813479502623e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:25:58,434] [INFO] [timer.py:199:stop] epoch=0/micro_step=3460/global_step=3460, RunningAvgSamplesPerSec=233.7107994318959, CurrSamplesPerSec=237.78918574162, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:26:01,136] [INFO] [logging.py:96:log_dist] [Rank 0] step=3470, skipped=32, lr=[3.384313399159944e-06, 3.384313399159944e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:26:01,145] [INFO] [timer.py:199:stop] epoch=0/micro_step=3470/global_step=3470, RunningAvgSamplesPerSec=233.71889031223856, CurrSamplesPerSec=238.29939909255785, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:26:03,822] [INFO] [logging.py:96:log_dist] [Rank 0] step=3480, skipped=32, lr=[3.2893961890555157e-06, 3.2893961890555157e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:26:03,831] [INFO] [timer.py:199:stop] epoch=0/micro_step=3480/global_step=3480, RunningAvgSamplesPerSec=233.73305227057276, CurrSamplesPerSec=239.0112580446404, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:26:06,517] [INFO] [logging.py:96:log_dist] [Rank 0] step=3490, skipped=32, lr=[3.195735209788528e-06, 3.195735209788528e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:26:06,811] [INFO] [timer.py:199:stop] epoch=0/micro_step=3490/global_step=3490, RunningAvgSamplesPerSec=233.67516294251055, CurrSamplesPerSec=115.67088482140818, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:26:07,596] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:26:07,596] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:26:07,596] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:26:07,596] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:26:07,596] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:26:07,597] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:26:07,596] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:26:07,597] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:26:07,597] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:26:07,597] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:26:07,597] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:26:07,597] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:26:07,597] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:26:07,597] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:26:07,597] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:26:07,597] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:26:08,126] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3494 [2023-04-14 08:26:08,126] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:26:08,126] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3494 [2023-04-14 08:26:08,126] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:26:08,126] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3494 [2023-04-14 08:26:08,127] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3494 [2023-04-14 08:26:08,127] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3494 [2023-04-14 08:26:08,127] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:26:08,127] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:26:08,127] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:26:08,127] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0 [2023-04-14 08:26:08,127] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3494 [2023-04-14 08:26:08,127] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:26:08,127] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3494 [2023-04-14 08:26:08,127] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:26:08,127] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3494 [2023-04-14 08:26:08,127] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:26:09,464] [INFO] [logging.py:96:log_dist] [Rank 0] step=3500, skipped=33, lr=[3.112518886507282e-06, 3.112518886507282e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:26:09,473] [INFO] [timer.py:199:stop] epoch=0/micro_step=3500/global_step=3500, RunningAvgSamplesPerSec=233.6951625000216, CurrSamplesPerSec=235.79215435846038, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:26:12,148] [INFO] [logging.py:96:log_dist] [Rank 0] step=3510, skipped=33, lr=[3.021259616121544e-06, 3.021259616121544e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:26:12,157] [INFO] [timer.py:199:stop] epoch=0/micro_step=3510/global_step=3510, RunningAvgSamplesPerSec=233.7095923679182, CurrSamplesPerSec=238.29897600014203, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:26:14,837] [INFO] [logging.py:96:log_dist] [Rank 0] step=3520, skipped=33, lr=[2.9312720916383514e-06, 2.9312720916383514e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:26:14,846] [INFO] [timer.py:199:stop] epoch=0/micro_step=3520/global_step=3520, RunningAvgSamplesPerSec=233.72288114172304, CurrSamplesPerSec=239.49596149299407, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:26:17,526] [INFO] [logging.py:96:log_dist] [Rank 0] step=3530, skipped=33, lr=[2.842561519965084e-06, 2.842561519965084e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:26:17,535] [INFO] [timer.py:199:stop] epoch=0/micro_step=3530/global_step=3530, RunningAvgSamplesPerSec=233.7358790022462, CurrSamplesPerSec=238.5692145941704, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:26:20,214] [INFO] [logging.py:96:log_dist] [Rank 0] step=3540, skipped=33, lr=[2.7551330341213793e-06, 2.7551330341213793e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:26:20,223] [INFO] [timer.py:199:stop] epoch=0/micro_step=3540/global_step=3540, RunningAvgSamplesPerSec=233.7493872647217, CurrSamplesPerSec=239.44319505620464, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:26:22,899] [INFO] [logging.py:96:log_dist] [Rank 0] step=3550, skipped=33, lr=[2.668991692942133e-06, 2.668991692942133e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:26:22,908] [INFO] [timer.py:199:stop] epoch=0/micro_step=3550/global_step=3550, RunningAvgSamplesPerSec=233.7632288136767, CurrSamplesPerSec=238.85175529292846, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:26:25,586] [INFO] [logging.py:96:log_dist] [Rank 0] step=3560, skipped=33, lr=[2.5841424807847543e-06, 2.5841424807847543e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:26:25,595] [INFO] [timer.py:199:stop] epoch=0/micro_step=3560/global_step=3560, RunningAvgSamplesPerSec=233.77662075736808, CurrSamplesPerSec=239.36526291478776, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:26:28,274] [INFO] [logging.py:96:log_dist] [Rank 0] step=3570, skipped=33, lr=[2.5005903072407826e-06, 2.5005903072407826e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:26:28,983] [INFO] [timer.py:199:stop] epoch=0/micro_step=3570/global_step=3570, RunningAvgSamplesPerSec=233.62231220198169, CurrSamplesPerSec=66.06667866079562, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:26:31,664] [INFO] [logging.py:96:log_dist] [Rank 0] step=3580, skipped=33, lr=[2.418340006851813e-06, 2.418340006851813e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:26:31,673] [INFO] [timer.py:199:stop] epoch=0/micro_step=3580/global_step=3580, RunningAvgSamplesPerSec=233.6354465125821, CurrSamplesPerSec=238.57748389103676, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:26:34,411] [INFO] [logging.py:96:log_dist] [Rank 0] step=3590, skipped=33, lr=[2.337396338829731e-06, 2.337396338829731e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:26:34,420] [INFO] [timer.py:199:stop] epoch=0/micro_step=3590/global_step=3590, RunningAvgSamplesPerSec=233.6348671009257, CurrSamplesPerSec=239.43315711216104, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:26:36,010] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:26:36,010] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:26:36,010] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:26:36,010] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:26:36,010] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:26:36,010] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:26:36,010] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:26:36,011] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:26:36,011] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:26:36,011] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:26:36,011] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:26:36,011] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:26:36,011] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:26:36,011] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:26:36,011] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:26:36,011] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 16384.0 to 32768.0 [2023-04-14 08:26:36,271] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3596 [2023-04-14 08:26:36,271] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:26:36,271] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3596 [2023-04-14 08:26:36,271] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3596 [2023-04-14 08:26:36,271] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:26:36,271] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:26:36,271] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0 [2023-04-14 08:26:36,271] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3596 [2023-04-14 08:26:36,272] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:26:36,272] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3596 [2023-04-14 08:26:36,272] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:26:36,272] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3596 [2023-04-14 08:26:36,272] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:26:36,272] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3596 [2023-04-14 08:26:36,272] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:26:36,272] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3596 [2023-04-14 08:26:36,272] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 32768.0 to 16384.0 [2023-04-14 08:26:37,070] [INFO] [logging.py:96:log_dist] [Rank 0] step=3600, skipped=34, lr=[2.265668080851138e-06, 2.265668080851138e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:26:37,079] [INFO] [timer.py:199:stop] epoch=0/micro_step=3600/global_step=3600, RunningAvgSamplesPerSec=233.65522403203192, CurrSamplesPerSec=238.76762368646231, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:26:39,764] [INFO] [logging.py:96:log_dist] [Rank 0] step=3610, skipped=34, lr=[2.187219854956843e-06, 2.187219854956843e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:26:39,773] [INFO] [timer.py:199:stop] epoch=0/micro_step=3610/global_step=3610, RunningAvgSamplesPerSec=233.66886067314644, CurrSamplesPerSec=238.98934038217217, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:26:41,887] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3617 [2023-04-14 08:26:41,887] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3617 [2023-04-14 08:26:41,887] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-14 08:26:41,887] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-14 08:26:41,887] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3617 [2023-04-14 08:26:41,887] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3617 [2023-04-14 08:26:41,887] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3617 [2023-04-14 08:26:41,887] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3617 [2023-04-14 08:26:41,887] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3617 [2023-04-14 08:26:41,888] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-14 08:26:41,887] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3617 [2023-04-14 08:26:41,888] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-14 08:26:41,888] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-14 08:26:41,888] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-14 08:26:41,888] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-14 08:26:41,888] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-14 08:26:41,888] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 16384.0, reducing to 8192.0 [2023-04-14 08:26:42,416] [INFO] [logging.py:96:log_dist] [Rank 0] step=3620, skipped=35, lr=[2.1177449286216565e-06, 2.1177449286216565e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:26:42,425] [INFO] [timer.py:199:stop] epoch=0/micro_step=3620/global_step=3620, RunningAvgSamplesPerSec=233.69078445939869, CurrSamplesPerSec=238.99274480388536, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:26:45,100] [INFO] [logging.py:96:log_dist] [Rank 0] step=3630, skipped=35, lr=[2.041808531212086e-06, 2.041808531212086e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:26:45,108] [INFO] [timer.py:199:stop] epoch=0/micro_step=3630/global_step=3630, RunningAvgSamplesPerSec=233.70496532054352, CurrSamplesPerSec=239.122610759987, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:26:47,801] [INFO] [logging.py:96:log_dist] [Rank 0] step=3640, skipped=35, lr=[1.9672005532396758e-06, 1.9672005532396758e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:26:47,810] [INFO] [timer.py:199:stop] epoch=0/micro_step=3640/global_step=3640, RunningAvgSamplesPerSec=233.71479526619677, CurrSamplesPerSec=236.77486777993784, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:26:50,514] [INFO] [logging.py:96:log_dist] [Rank 0] step=3650, skipped=35, lr=[1.8939253117120748e-06, 1.8939253117120748e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:26:51,171] [INFO] [timer.py:199:stop] epoch=0/micro_step=3650/global_step=3650, RunningAvgSamplesPerSec=233.5705362891832, CurrSamplesPerSec=69.90472078393363, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:26:53,849] [INFO] [logging.py:96:log_dist] [Rank 0] step=3660, skipped=35, lr=[1.8219870465214033e-06, 1.8219870465214033e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:26:53,858] [INFO] [timer.py:199:stop] epoch=0/micro_step=3660/global_step=3660, RunningAvgSamplesPerSec=233.5842654096895, CurrSamplesPerSec=238.64853864857776, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:26:56,534] [INFO] [logging.py:96:log_dist] [Rank 0] step=3670, skipped=35, lr=[1.7513899201989148e-06, 1.7513899201989148e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:26:56,543] [INFO] [timer.py:199:stop] epoch=0/micro_step=3670/global_step=3670, RunningAvgSamplesPerSec=233.59827962030147, CurrSamplesPerSec=238.66254248725716, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:26:59,219] [INFO] [logging.py:96:log_dist] [Rank 0] step=3680, skipped=35, lr=[1.682138017674173e-06, 1.682138017674173e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:26:59,228] [INFO] [timer.py:199:stop] epoch=0/micro_step=3680/global_step=3680, RunningAvgSamplesPerSec=233.61226987120983, CurrSamplesPerSec=238.71666367272746, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:27:01,906] [INFO] [logging.py:96:log_dist] [Rank 0] step=3690, skipped=35, lr=[1.6142353460386533e-06, 1.6142353460386533e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:27:01,915] [INFO] [timer.py:199:stop] epoch=0/micro_step=3690/global_step=3690, RunningAvgSamplesPerSec=233.62567852670512, CurrSamplesPerSec=239.122610759987, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:27:04,594] [INFO] [logging.py:96:log_dist] [Rank 0] step=3700, skipped=35, lr=[1.5476858343138972e-06, 1.5476858343138972e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:27:04,603] [INFO] [timer.py:199:stop] epoch=0/micro_step=3700/global_step=3700, RunningAvgSamplesPerSec=233.63870307783156, CurrSamplesPerSec=238.13260566403963, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:27:07,291] [INFO] [logging.py:96:log_dist] [Rank 0] step=3710, skipped=35, lr=[1.4824933332241692e-06, 1.4824933332241692e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:27:07,299] [INFO] [timer.py:199:stop] epoch=0/micro_step=3710/global_step=3710, RunningAvgSamplesPerSec=233.64994670632805, CurrSamplesPerSec=238.8538805959525, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:27:09,769] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:27:09,769] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-14 08:27:09,769] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:27:09,769] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:27:09,769] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:27:09,769] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:27:09,769] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-14 08:27:09,769] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-14 08:27:09,769] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-14 08:27:09,769] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-14 08:27:09,769] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:27:09,770] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-14 08:27:09,770] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:27:09,770] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-14 08:27:09,772] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:27:09,773] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-14 08:27:10,052] [INFO] [logging.py:96:log_dist] [Rank 0] step=3720, skipped=35, lr=[1.4186616149736349e-06, 1.4186616149736349e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:27:10,061] [INFO] [timer.py:199:stop] epoch=0/micro_step=3720/global_step=3720, RunningAvgSamplesPerSec=233.64608009477917, CurrSamplesPerSec=236.23227259926782, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:27:12,744] [INFO] [logging.py:96:log_dist] [Rank 0] step=3730, skipped=35, lr=[1.3561943730281052e-06, 1.3561943730281052e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:27:12,753] [INFO] [timer.py:199:stop] epoch=0/micro_step=3730/global_step=3730, RunningAvgSamplesPerSec=233.65812970916414, CurrSamplesPerSec=236.9870389449643, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:27:15,429] [INFO] [logging.py:96:log_dist] [Rank 0] step=3740, skipped=35, lr=[1.295095221901313e-06, 1.295095221901313e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:27:16,341] [INFO] [timer.py:199:stop] epoch=0/micro_step=3740/global_step=3740, RunningAvgSamplesPerSec=233.46586845236598, CurrSamplesPerSec=54.68232036714613, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:27:17,111] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3742 [2023-04-14 08:27:17,111] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3742 [2023-04-14 08:27:17,111] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3742 [2023-04-14 08:27:17,111] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3742 [2023-04-14 08:27:17,111] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-14 08:27:17,111] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-14 08:27:17,111] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-14 08:27:17,111] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-14 08:27:17,111] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3742 [2023-04-14 08:27:17,111] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 16384.0, reducing to 8192.0 [2023-04-14 08:27:17,112] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-14 08:27:17,111] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3742 [2023-04-14 08:27:17,111] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3742 [2023-04-14 08:27:17,111] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3742 [2023-04-14 08:27:17,112] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-14 08:27:17,112] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-14 08:27:17,112] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 16384.0 to 8192.0 [2023-04-14 08:27:18,983] [INFO] [logging.py:96:log_dist] [Rank 0] step=3750, skipped=36, lr=[1.2412786271450622e-06, 1.2412786271450622e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:27:18,992] [INFO] [timer.py:199:stop] epoch=0/micro_step=3750/global_step=3750, RunningAvgSamplesPerSec=233.48758698659873, CurrSamplesPerSec=238.55776574482135, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:27:21,672] [INFO] [logging.py:96:log_dist] [Rank 0] step=3760, skipped=36, lr=[1.1827885228783863e-06, 1.1827885228783863e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:27:21,681] [INFO] [timer.py:199:stop] epoch=0/micro_step=3760/global_step=3760, RunningAvgSamplesPerSec=233.50062507972723, CurrSamplesPerSec=237.10258632035183, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:27:24,361] [INFO] [logging.py:96:log_dist] [Rank 0] step=3770, skipped=36, lr=[1.1256765431346406e-06, 1.1256765431346406e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:27:24,370] [INFO] [timer.py:199:stop] epoch=0/micro_step=3770/global_step=3770, RunningAvgSamplesPerSec=233.51362738210406, CurrSamplesPerSec=237.17360334331735, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:27:27,048] [INFO] [logging.py:96:log_dist] [Rank 0] step=3780, skipped=36, lr=[1.0699459925584409e-06, 1.0699459925584409e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:27:27,057] [INFO] [timer.py:199:stop] epoch=0/micro_step=3780/global_step=3780, RunningAvgSamplesPerSec=233.526947418061, CurrSamplesPerSec=239.0427584241647, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:27:29,734] [INFO] [logging.py:96:log_dist] [Rank 0] step=3790, skipped=36, lr=[1.0156000958614049e-06, 1.0156000958614049e-06], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:27:29,743] [INFO] [timer.py:199:stop] epoch=0/micro_step=3790/global_step=3790, RunningAvgSamplesPerSec=233.54026165894058, CurrSamplesPerSec=238.7699598752227, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:27:32,422] [INFO] [logging.py:96:log_dist] [Rank 0] step=3800, skipped=36, lr=[9.62641997635541e-07, 9.62641997635541e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:27:32,431] [INFO] [timer.py:199:stop] epoch=0/micro_step=3800/global_step=3800, RunningAvgSamplesPerSec=233.55333833199126, CurrSamplesPerSec=238.0297710729167, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:27:35,116] [INFO] [logging.py:96:log_dist] [Rank 0] step=3810, skipped=36, lr=[9.110747621713101e-07, 9.110747621713101e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:27:35,125] [INFO] [timer.py:199:stop] epoch=0/micro_step=3810/global_step=3810, RunningAvgSamplesPerSec=233.56484685636897, CurrSamplesPerSec=238.72876470624854, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:27:37,805] [INFO] [logging.py:96:log_dist] [Rank 0] step=3820, skipped=36, lr=[8.609013732803039e-07, 8.609013732803039e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:27:37,991] [INFO] [timer.py:199:stop] epoch=0/micro_step=3820/global_step=3820, RunningAvgSamplesPerSec=233.53802422268942, CurrSamplesPerSec=143.64070167027148, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:27:40,652] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3829 [2023-04-14 08:27:40,652] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-04-14 08:27:40,652] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3829 [2023-04-14 08:27:40,652] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 8192.0, reducing to 4096.0 [2023-04-14 08:27:40,652] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3829 [2023-04-14 08:27:40,652] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3829 [2023-04-14 08:27:40,652] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3829 [2023-04-14 08:27:40,652] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3829 [2023-04-14 08:27:40,652] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3829 [2023-04-14 08:27:40,652] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-04-14 08:27:40,652] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-04-14 08:27:40,652] [INFO] [fused_optimizer.py:362:_update_scale] Grad overflow on iteration 3829 [2023-04-14 08:27:40,652] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-04-14 08:27:40,652] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-04-14 08:27:40,652] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-04-14 08:27:40,652] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-04-14 08:27:40,652] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 8192.0 to 4096.0 [2023-04-14 08:27:40,653] [INFO] [logging.py:96:log_dist] [Rank 0] step=3830, skipped=37, lr=[8.169394632294991e-07, 8.169394632294991e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:27:40,653] [INFO] [timer.py:199:stop] epoch=0/micro_step=3830/global_step=3830, RunningAvgSamplesPerSec=233.5566418251891, CurrSamplesPerSec=275.16065763321336, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:27:43,394] [INFO] [logging.py:96:log_dist] [Rank 0] step=3840, skipped=37, lr=[7.694223142397289e-07, 7.694223142397289e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:27:43,402] [INFO] [timer.py:199:stop] epoch=0/micro_step=3840/global_step=3840, RunningAvgSamplesPerSec=233.55582444116885, CurrSamplesPerSec=237.29080106890513, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:27:46,081] [INFO] [logging.py:96:log_dist] [Rank 0] step=3850, skipped=37, lr=[7.233072081946401e-07, 7.233072081946401e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:27:46,090] [INFO] [timer.py:199:stop] epoch=0/micro_step=3850/global_step=3850, RunningAvgSamplesPerSec=233.56869203642034, CurrSamplesPerSec=238.948920287591, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:27:48,767] [INFO] [logging.py:96:log_dist] [Rank 0] step=3860, skipped=37, lr=[6.785968134317283e-07, 6.785968134317283e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:27:48,776] [INFO] [timer.py:199:stop] epoch=0/micro_step=3860/global_step=3860, RunningAvgSamplesPerSec=233.58171765820816, CurrSamplesPerSec=239.16820374011803, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:27:51,457] [INFO] [logging.py:96:log_dist] [Rank 0] step=3870, skipped=37, lr=[6.352937170083229e-07, 6.352937170083229e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:27:51,465] [INFO] [timer.py:199:stop] epoch=0/micro_step=3870/global_step=3870, RunningAvgSamplesPerSec=233.59395455586565, CurrSamplesPerSec=238.9633848556038, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:27:54,192] [INFO] [logging.py:96:log_dist] [Rank 0] step=3880, skipped=37, lr=[5.934004245518793e-07, 5.934004245518793e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:27:54,201] [INFO] [timer.py:199:stop] epoch=0/micro_step=3880/global_step=3880, RunningAvgSamplesPerSec=233.59642586506257, CurrSamplesPerSec=239.32578243131437, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:27:56,878] [INFO] [logging.py:96:log_dist] [Rank 0] step=3890, skipped=37, lr=[5.529193601150118e-07, 5.529193601150118e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:27:56,886] [INFO] [timer.py:199:stop] epoch=0/micro_step=3890/global_step=3890, RunningAvgSamplesPerSec=233.6093650000684, CurrSamplesPerSec=238.9621084997494, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:27:59,562] [INFO] [logging.py:96:log_dist] [Rank 0] step=3900, skipped=37, lr=[5.13852866035222e-07, 5.13852866035222e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:27:59,783] [INFO] [timer.py:199:stop] epoch=0/micro_step=3900/global_step=3900, RunningAvgSamplesPerSec=233.57609260733065, CurrSamplesPerSec=133.407610286535, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:28:02,459] [INFO] [logging.py:96:log_dist] [Rank 0] step=3910, skipped=37, lr=[4.7620320279936904e-07, 4.7620320279936904e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:28:02,468] [INFO] [timer.py:199:stop] epoch=0/micro_step=3910/global_step=3910, RunningAvgSamplesPerSec=233.58922695039007, CurrSamplesPerSec=239.02083498432413, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:28:05,144] [INFO] [logging.py:96:log_dist] [Rank 0] step=3920, skipped=37, lr=[4.399725489128648e-07, 4.399725489128648e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:28:05,153] [INFO] [timer.py:199:stop] epoch=0/micro_step=3920/global_step=3920, RunningAvgSamplesPerSec=233.60231838737667, CurrSamplesPerSec=238.8130813850041, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:28:07,840] [INFO] [logging.py:96:log_dist] [Rank 0] step=3930, skipped=37, lr=[4.0516300077363123e-07, 4.0516300077363123e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:28:07,849] [INFO] [timer.py:199:stop] epoch=0/micro_step=3930/global_step=3930, RunningAvgSamplesPerSec=233.61293772097804, CurrSamplesPerSec=238.8130813850041, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:28:08,097] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:28:08,097] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-04-14 08:28:08,097] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:28:08,097] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-04-14 08:28:08,097] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:28:08,097] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:28:08,097] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-04-14 08:28:08,097] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-04-14 08:28:08,097] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:28:08,097] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-04-14 08:28:08,097] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:28:08,097] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-04-14 08:28:08,098] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:28:08,098] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-04-14 08:28:08,098] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:28:08,098] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 4096.0 to 8192.0 [2023-04-14 08:28:10,531] [INFO] [logging.py:96:log_dist] [Rank 0] step=3940, skipped=37, lr=[3.717765725507888e-07, 3.717765725507888e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:28:11,970] [INFO] [timer.py:199:stop] epoch=0/micro_step=3940/global_step=3940, RunningAvgSamplesPerSec=233.31529669995987, CurrSamplesPerSec=37.67500895084327, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:28:14,721] [INFO] [logging.py:96:log_dist] [Rank 0] step=3950, skipped=37, lr=[3.398151960681162e-07, 3.398151960681162e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:28:14,730] [INFO] [timer.py:199:stop] epoch=0/micro_step=3950/global_step=3950, RunningAvgSamplesPerSec=233.31427402443424, CurrSamplesPerSec=238.56136989369247, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:28:17,414] [INFO] [logging.py:96:log_dist] [Rank 0] step=3960, skipped=37, lr=[3.0928072069227044e-07, 3.0928072069227044e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:28:17,423] [INFO] [timer.py:199:stop] epoch=0/micro_step=3960/global_step=3960, RunningAvgSamplesPerSec=233.32633222001914, CurrSamplesPerSec=233.97778201973557, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:28:20,100] [INFO] [logging.py:96:log_dist] [Rank 0] step=3970, skipped=37, lr=[2.8017491322576973e-07, 2.8017491322576973e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:28:20,109] [INFO] [timer.py:199:stop] epoch=0/micro_step=3970/global_step=3970, RunningAvgSamplesPerSec=233.33965070949623, CurrSamplesPerSec=239.0706474449805, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:28:22,791] [INFO] [logging.py:96:log_dist] [Rank 0] step=3980, skipped=37, lr=[2.524994578047751e-07, 2.524994578047751e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:28:22,800] [INFO] [timer.py:199:stop] epoch=0/micro_step=3980/global_step=3980, RunningAvgSamplesPerSec=233.35198325259722, CurrSamplesPerSec=238.68185350568794, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:28:25,478] [INFO] [logging.py:96:log_dist] [Rank 0] step=3990, skipped=37, lr=[2.262559558016325e-07, 2.262559558016325e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:28:25,487] [INFO] [timer.py:199:stop] epoch=0/micro_step=3990/global_step=3990, RunningAvgSamplesPerSec=233.36499857650637, CurrSamplesPerSec=238.4551991735119, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:28:28,165] [INFO] [logging.py:96:log_dist] [Rank 0] step=4000, skipped=37, lr=[2.0144592573221044e-07, 2.0144592573221044e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:28:28,174] [INFO] [timer.py:199:stop] epoch=0/micro_step=4000/global_step=4000, RunningAvgSamplesPerSec=233.37779184133674, CurrSamplesPerSec=238.83114181617591, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:28:30,850] [INFO] [logging.py:96:log_dist] [Rank 0] step=4010, skipped=37, lr=[1.7807080316804835e-07, 1.7807080316804835e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:28:30,859] [INFO] [timer.py:199:stop] epoch=0/micro_step=4010/global_step=4010, RunningAvgSamplesPerSec=233.39126098396906, CurrSamplesPerSec=239.27842303900508, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:28:33,535] [INFO] [logging.py:96:log_dist] [Rank 0] step=4020, skipped=37, lr=[1.561319406532785e-07, 1.561319406532785e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:28:33,544] [INFO] [timer.py:199:stop] epoch=0/micro_step=4020/global_step=4020, RunningAvgSamplesPerSec=233.40435950099132, CurrSamplesPerSec=239.16884301761988, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:28:36,221] [INFO] [logging.py:96:log_dist] [Rank 0] step=4030, skipped=37, lr=[1.3563060762636637e-07, 1.3563060762636637e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:28:36,754] [INFO] [timer.py:199:stop] epoch=0/micro_step=4030/global_step=4030, RunningAvgSamplesPerSec=233.30668696939878, CurrSamplesPerSec=80.80150769424341, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:28:37,003] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:28:37,003] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:28:37,003] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-14 08:28:37,003] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-14 08:28:37,003] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:28:37,003] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:28:37,003] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:28:37,003] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-14 08:28:37,004] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-14 08:28:37,003] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:28:37,004] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-14 08:28:37,004] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-14 08:28:37,004] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:28:37,004] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-14 08:28:37,004] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 100 iterations [2023-04-14 08:28:37,004] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 8192.0 to 16384.0 [2023-04-14 08:28:39,430] [INFO] [logging.py:96:log_dist] [Rank 0] step=4040, skipped=37, lr=[1.1656799034666376e-07, 1.1656799034666376e-07], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:28:39,438] [INFO] [timer.py:199:stop] epoch=0/micro_step=4040/global_step=4040, RunningAvgSamplesPerSec=233.3203011911347, CurrSamplesPerSec=239.63108115225543, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:28:42,114] [INFO] [logging.py:96:log_dist] [Rank 0] step=4050, skipped=37, lr=[9.894519182576111e-08, 9.894519182576111e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:28:42,123] [INFO] [timer.py:199:stop] epoch=0/micro_step=4050/global_step=4050, RunningAvgSamplesPerSec=233.3337934590055, CurrSamplesPerSec=238.8506926555995, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:28:44,809] [INFO] [logging.py:96:log_dist] [Rank 0] step=4060, skipped=37, lr=[8.276323176366896e-08, 8.276323176366896e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:28:44,819] [INFO] [timer.py:199:stop] epoch=0/micro_step=4060/global_step=4060, RunningAvgSamplesPerSec=233.34488103081313, CurrSamplesPerSec=238.31632402122545, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:28:47,518] [INFO] [logging.py:96:log_dist] [Rank 0] step=4070, skipped=37, lr=[6.802304648981517e-08, 6.802304648981517e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:28:47,527] [INFO] [timer.py:199:stop] epoch=0/micro_step=4070/global_step=4070, RunningAvgSamplesPerSec=233.35321674342305, CurrSamplesPerSec=239.15243597681845, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:28:50,202] [INFO] [logging.py:96:log_dist] [Rank 0] step=4080, skipped=37, lr=[5.472548890886886e-08, 5.472548890886886e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:28:50,211] [INFO] [timer.py:199:stop] epoch=0/micro_step=4080/global_step=4080, RunningAvgSamplesPerSec=233.36661443192827, CurrSamplesPerSec=238.8226424494771, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:28:52,890] [INFO] [logging.py:96:log_dist] [Rank 0] step=4090, skipped=37, lr=[4.2871328451377094e-08, 4.2871328451377094e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:28:52,899] [INFO] [timer.py:199:stop] epoch=0/micro_step=4090/global_step=4090, RunningAvgSamplesPerSec=233.3791961537069, CurrSamplesPerSec=239.04126835427934, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:28:55,578] [INFO] [logging.py:96:log_dist] [Rank 0] step=4100, skipped=37, lr=[3.246125102926434e-08, 3.246125102926434e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:28:55,587] [INFO] [timer.py:199:stop] epoch=0/micro_step=4100/global_step=4100, RunningAvgSamplesPerSec=233.39147023066764, CurrSamplesPerSec=238.8205177023234, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:28:58,311] [INFO] [logging.py:96:log_dist] [Rank 0] step=4110, skipped=37, lr=[2.349585899612261e-08, 2.349585899612261e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:28:59,498] [INFO] [timer.py:199:stop] epoch=0/micro_step=4110/global_step=4110, RunningAvgSamplesPerSec=233.15203591314105, CurrSamplesPerSec=43.90343827769477, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:29:02,174] [INFO] [logging.py:96:log_dist] [Rank 0] step=4120, skipped=37, lr=[1.5975671112372636e-08, 1.5975671112372636e-08], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:29:02,183] [INFO] [timer.py:199:stop] epoch=0/micro_step=4120/global_step=4120, RunningAvgSamplesPerSec=233.165541643602, CurrSamplesPerSec=238.83071683347316, MemAllocated=4.32GB, MaxMemAllocated=21.24GB [2023-04-14 08:29:04,785] [INFO] [logging.py:96:log_dist] [Rank 0] step=4130, skipped=37, lr=[9.901122515240669e-09, 9.901122515240669e-09], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-14 08:29:04,794] [INFO] [timer.py:199:stop] epoch=0/micro_step=4130/global_step=4130, RunningAvgSamplesPerSec=233.19453146248313, CurrSamplesPerSec=336.3578856358818, MemAllocated=4.32GB, MaxMemAllocated=21.24GB Epoch 1/1 with loss 0.5945421269673123 ***** Evaluating reward, Epoch 1/1 ***** chosen_last_scores (higher is better) : 1.6397132873535156, acc (higher is better) : 0.6174242496490479 saving model ... [2023-04-14 08:29:37,481] [INFO] [launch.py:460:main] Process 351158 exits successfully. [2023-04-14 08:29:39,032] [INFO] [launch.py:460:main] Process 351159 exits successfully. [2023-04-14 08:29:39,032] [INFO] [launch.py:460:main] Process 351156 exits successfully. [2023-04-14 08:29:39,033] [INFO] [launch.py:460:main] Process 351160 exits successfully. [2023-04-14 08:29:39,033] [INFO] [launch.py:460:main] Process 351161 exits successfully. [2023-04-14 08:29:39,033] [INFO] [launch.py:460:main] Process 351157 exits successfully. [2023-04-14 08:29:39,033] [INFO] [launch.py:460:main] Process 351154 exits successfully. [2023-04-14 08:29:54,049] [INFO] [launch.py:460:main] Process 351153 exits successfully.