[2024-10-10 13:02:53,220] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) df: /root/.triton/autotune: No such file or directory INFO:root:Using nproc_per_node=8. W1010 13:03:02.572000 140546484668224 torch/distributed/run.py:779] W1010 13:03:02.572000 140546484668224 torch/distributed/run.py:779] ***************************************** W1010 13:03:02.572000 140546484668224 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W1010 13:03:02.572000 140546484668224 torch/distributed/run.py:779] ***************************************** [2024-10-10 13:03:16,994] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-10-10 13:03:17,073] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-10-10 13:03:17,075] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-10-10 13:03:17,081] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-10-10 13:03:17,083] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-10-10 13:03:17,091] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-10-10 13:03:17,103] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-10-10 13:03:17,105] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-10-10 13:03:18,744] [INFO] [comm.py:652:init_distributed] cdb=None 2024-10-10 13:03:19 - WARNING - __main__ - Process rank: 3, device: cuda:3, n_gpu: 1 distributed training: True, 16-bits training: False [2024-10-10 13:03:19,264] [INFO] [comm.py:652:init_distributed] cdb=None [2024-10-10 13:03:19,312] [INFO] [comm.py:652:init_distributed] cdb=None [2024-10-10 13:03:19,312] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [2024-10-10 13:03:19,312] [INFO] [comm.py:652:init_distributed] cdb=None [2024-10-10 13:03:19,330] [INFO] [comm.py:652:init_distributed] cdb=None [2024-10-10 13:03:19,337] [INFO] [comm.py:652:init_distributed] cdb=None [2024-10-10 13:03:19,350] [INFO] [comm.py:652:init_distributed] cdb=None [2024-10-10 13:03:19,360] [INFO] [comm.py:652:init_distributed] cdb=None 2024-10-10 13:03:20 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1 distributed training: True, 16-bits training: False 2024-10-10 13:03:20 - INFO - __main__ - Model parameters ModelArguments(base_model_revision=None, model_name_or_path='/workspace/jihaozhe/models/Llama-3.2-1B', model_revision='main', model_code_revision=None, torch_dtype='bfloat16', tokenizer_name_or_path=None, trust_remote_code=False, attn_implementation='flash_attention_2', use_peft=False, lora_r=16, lora_alpha=32, lora_dropout=0.05, lora_target_modules=None, lora_modules_to_save=None, load_in_8bit=False, load_in_4bit=False, bnb_4bit_quant_type='nf4', use_bnb_nested_quant=False, bnb_4bit_quant_storage='uint8') 2024-10-10 13:03:20 - INFO - __main__ - Data parameters DataArguments(chat_template="{% for message in messages %}\n{% if message['role'] == 'user' %}\n{{ '<|user|>\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'system' %}\n{{ '<|system|>\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'assistant' %}\n{{ '<|assistant|>\n' + message['content'] + eos_token }}\n{% endif %}\n{% if loop.last and add_generation_prompt %}\n{{ '<|assistant|>' }}\n{% endif %}\n{% endfor %}", dataset_mixer={'/workspace/jihaozhe/data/ultrachat_200k': 1.0}, text_column='text', dataset_splits=['train_sft', 'test_sft'], dataset_configs=None, preprocessing_num_workers=12, truncation_side=None, auto_insert_empty_system_msg=True) 2024-10-10 13:03:20 - INFO - __main__ - Training/evaluation parameters SFTConfig( _n_gpu=1, accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, batch_eval_metrics=False, bf16=True, bf16_full_eval=False, chars_per_token=, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=0, dataloader_persistent_workers=False, dataloader_pin_memory=True, dataloader_prefetch_factor=None, dataset_batch_size=1000, dataset_kwargs=None, dataset_num_proc=None, dataset_text_field=None, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, ddp_timeout=1800, debug=[], deepspeed=None, disable_tqdm=False, dispatch_batches=None, do_eval=True, do_predict=False, do_train=False, eval_accumulation_steps=None, eval_delay=0, eval_do_concat_batches=True, eval_on_start=False, eval_packing=None, eval_steps=None, eval_strategy=epoch, eval_use_gather_object=False, evaluation_strategy=None, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, gradient_accumulation_steps=1, gradient_checkpointing=True, gradient_checkpointing_kwargs={'use_reentrant': False}, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_always_push=False, hub_model_id=Llama-3.2-1B-sft-full, hub_model_revision=main, hub_private_repo=False, hub_strategy=every_save, hub_token=, ignore_data_skip=False, include_inputs_for_metrics=False, include_num_input_tokens_seen=False, include_tokens_per_second=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=2e-05, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=info, log_level_replica=warning, log_on_each_node=True, logging_dir=/workspace/jihaozhe/models/Llama-3.2-1B-sft-full/runs/Oct10_13-03-19_node090, logging_first_step=True, logging_nan_inf_filter=True, logging_steps=5, logging_strategy=steps, lr_scheduler_kwargs={}, lr_scheduler_type=cosine, max_grad_norm=1.0, max_seq_length=2048, max_steps=-1, metric_for_best_model=None, model_init_kwargs=None, mp_parameters=, neftune_noise_alpha=None, no_cuda=False, num_of_sequences=1024, num_train_epochs=1, optim=adamw_torch, optim_args=None, optim_target_modules=None, output_dir=/workspace/jihaozhe/models/Llama-3.2-1B-sft-full, overwrite_output_dir=True, packing=False, past_index=-1, per_device_eval_batch_size=8, per_device_train_batch_size=16, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, ray_scope=last, remove_unused_columns=True, report_to=['tensorboard'], restore_callback_states_from_checkpoint=False, resume_from_checkpoint=None, run_name=/workspace/jihaozhe/models/Llama-3.2-1B-sft-full, save_on_each_node=False, save_only_model=False, save_safetensors=True, save_steps=100, save_strategy=steps, save_total_limit=1, seed=42, skip_memory_metrics=True, split_batches=None, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torch_empty_cache_steps=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, use_cpu=False, use_ipex=False, use_legacy_prediction_loop=False, use_liger=False, use_mps_device=False, warmup_ratio=0.1, warmup_steps=0, weight_decay=0.0, ) 2024-10-10 13:03:20 - WARNING - __main__ - Process rank: 1, device: cuda:1, n_gpu: 1 distributed training: True, 16-bits training: False 2024-10-10 13:03:20 - WARNING - __main__ - Process rank: 2, device: cuda:2, n_gpu: 1 distributed training: True, 16-bits training: False 2024-10-10 13:03:20 - WARNING - __main__ - Process rank: 6, device: cuda:6, n_gpu: 1 distributed training: True, 16-bits training: False 2024-10-10 13:03:20 - WARNING - __main__ - Process rank: 7, device: cuda:7, n_gpu: 1 distributed training: True, 16-bits training: False 2024-10-10 13:03:20 - WARNING - __main__ - Process rank: 5, device: cuda:5, n_gpu: 1 distributed training: True, 16-bits training: False 2024-10-10 13:03:20 - WARNING - __main__ - Process rank: 4, device: cuda:4, n_gpu: 1 distributed training: True, 16-bits training: False Generating dataset ultrachat_200k (/root/.cache/huggingface/datasets/ultrachat_200k/default/0.0.0/bf0771b442175ec7) 2024-10-10 13:03:21 - INFO - datasets.builder - Generating dataset ultrachat_200k (/root/.cache/huggingface/datasets/ultrachat_200k/default/0.0.0/bf0771b442175ec7) Downloading and preparing dataset ultrachat_200k/default to /root/.cache/huggingface/datasets/ultrachat_200k/default/0.0.0/bf0771b442175ec7... 2024-10-10 13:03:21 - INFO - datasets.builder - Downloading and preparing dataset ultrachat_200k/default to /root/.cache/huggingface/datasets/ultrachat_200k/default/0.0.0/bf0771b442175ec7... Downloading took 0.0 min 2024-10-10 13:03:21 - INFO - datasets.download.download_manager - Downloading took 0.0 min Checksum Computation took 0.0 min 2024-10-10 13:03:21 - INFO - datasets.download.download_manager - Checksum Computation took 0.0 min Generating train_sft split 2024-10-10 13:03:21 - INFO - datasets.builder - Generating train_sft split Generating train_sft split: 0%| | 0/207865 [00:00> loading file tokenizer.json [INFO|tokenization_utils_base.py:2267] 2024-10-10 13:03:31,502 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2267] 2024-10-10 13:03:31,502 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2267] 2024-10-10 13:03:31,502 >> loading file tokenizer_config.json [INFO|tokenization_utils_base.py:2513] 2024-10-10 13:03:31,821 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. 2024-10-10 13:03:31 - INFO - __main__ - *** Load pretrained model *** Process #0 will write at /root/.cache/huggingface/datasets/ultrachat_200k/default/0.0.0/bf0771b442175ec7/cache-4a082cf2cc72f67e_00000_of_00012.arrow 2024-10-10 13:03:31 - INFO - datasets.arrow_dataset - Process #0 will write at /root/.cache/huggingface/datasets/ultrachat_200k/default/0.0.0/bf0771b442175ec7/cache-4a082cf2cc72f67e_00000_of_00012.arrow Process #1 will write at /root/.cache/huggingface/datasets/ultrachat_200k/default/0.0.0/bf0771b442175ec7/cache-4a082cf2cc72f67e_00001_of_00012.arrow 2024-10-10 13:03:31 - INFO - datasets.arrow_dataset - Process #1 will write at /root/.cache/huggingface/datasets/ultrachat_200k/default/0.0.0/bf0771b442175ec7/cache-4a082cf2cc72f67e_00001_of_00012.arrow Process #2 will write at /root/.cache/huggingface/datasets/ultrachat_200k/default/0.0.0/bf0771b442175ec7/cache-4a082cf2cc72f67e_00002_of_00012.arrow 2024-10-10 13:03:31 - INFO - datasets.arrow_dataset - Process #2 will write at /root/.cache/huggingface/datasets/ultrachat_200k/default/0.0.0/bf0771b442175ec7/cache-4a082cf2cc72f67e_00002_of_00012.arrow Process #3 will write at /root/.cache/huggingface/datasets/ultrachat_200k/default/0.0.0/bf0771b442175ec7/cache-4a082cf2cc72f67e_00003_of_00012.arrow 2024-10-10 13:03:31 - INFO - datasets.arrow_dataset - Process #3 will write at /root/.cache/huggingface/datasets/ultrachat_200k/default/0.0.0/bf0771b442175ec7/cache-4a082cf2cc72f67e_00003_of_00012.arrow Process #4 will write at /root/.cache/huggingface/datasets/ultrachat_200k/default/0.0.0/bf0771b442175ec7/cache-4a082cf2cc72f67e_00004_of_00012.arrow 2024-10-10 13:03:31 - INFO - datasets.arrow_dataset - Process #4 will write at /root/.cache/huggingface/datasets/ultrachat_200k/default/0.0.0/bf0771b442175ec7/cache-4a082cf2cc72f67e_00004_of_00012.arrow Process #5 will write at /root/.cache/huggingface/datasets/ultrachat_200k/default/0.0.0/bf0771b442175ec7/cache-4a082cf2cc72f67e_00005_of_00012.arrow 2024-10-10 13:03:31 - INFO - datasets.arrow_dataset - Process #5 will write at /root/.cache/huggingface/datasets/ultrachat_200k/default/0.0.0/bf0771b442175ec7/cache-4a082cf2cc72f67e_00005_of_00012.arrow Process #6 will write at /root/.cache/huggingface/datasets/ultrachat_200k/default/0.0.0/bf0771b442175ec7/cache-4a082cf2cc72f67e_00006_of_00012.arrow 2024-10-10 13:03:31 - INFO - datasets.arrow_dataset - Process #6 will write at /root/.cache/huggingface/datasets/ultrachat_200k/default/0.0.0/bf0771b442175ec7/cache-4a082cf2cc72f67e_00006_of_00012.arrow Process #7 will write at /root/.cache/huggingface/datasets/ultrachat_200k/default/0.0.0/bf0771b442175ec7/cache-4a082cf2cc72f67e_00007_of_00012.arrow 2024-10-10 13:03:31 - INFO - datasets.arrow_dataset - Process #7 will write at /root/.cache/huggingface/datasets/ultrachat_200k/default/0.0.0/bf0771b442175ec7/cache-4a082cf2cc72f67e_00007_of_00012.arrow Process #8 will write at /root/.cache/huggingface/datasets/ultrachat_200k/default/0.0.0/bf0771b442175ec7/cache-4a082cf2cc72f67e_00008_of_00012.arrow 2024-10-10 13:03:31 - INFO - datasets.arrow_dataset - Process #8 will write at /root/.cache/huggingface/datasets/ultrachat_200k/default/0.0.0/bf0771b442175ec7/cache-4a082cf2cc72f67e_00008_of_00012.arrow Process #9 will write at /root/.cache/huggingface/datasets/ultrachat_200k/default/0.0.0/bf0771b442175ec7/cache-4a082cf2cc72f67e_00009_of_00012.arrow 2024-10-10 13:03:31 - INFO - datasets.arrow_dataset - Process #9 will write at /root/.cache/huggingface/datasets/ultrachat_200k/default/0.0.0/bf0771b442175ec7/cache-4a082cf2cc72f67e_00009_of_00012.arrow Process #10 will write at /root/.cache/huggingface/datasets/ultrachat_200k/default/0.0.0/bf0771b442175ec7/cache-4a082cf2cc72f67e_00010_of_00012.arrow 2024-10-10 13:03:31 - INFO - datasets.arrow_dataset - Process #10 will write at /root/.cache/huggingface/datasets/ultrachat_200k/default/0.0.0/bf0771b442175ec7/cache-4a082cf2cc72f67e_00010_of_00012.arrow Process #11 will write at /root/.cache/huggingface/datasets/ultrachat_200k/default/0.0.0/bf0771b442175ec7/cache-4a082cf2cc72f67e_00011_of_00012.arrow 2024-10-10 13:03:31 - INFO - datasets.arrow_dataset - Process #11 will write at /root/.cache/huggingface/datasets/ultrachat_200k/default/0.0.0/bf0771b442175ec7/cache-4a082cf2cc72f67e_00011_of_00012.arrow Spawning 12 processes 2024-10-10 13:03:32 - INFO - datasets.arrow_dataset - Spawning 12 processes Applying chat template (num_proc=12): 0%| | 0/207865 [00:00 <|end_of_text|> <|user|> Describe in detail how beauty consultants use augmented reality technology to enhance the customer experience and provide personalized recommendations. Include an explanation of the different types of augmented reality used, such as virtual try-on, facial recognition, and skin analysis. Additionally, discuss how these technologies are integrated into the consultant's workflow and how they help to improve customer satisfaction and sales. Provide examples of beauty brands that have successfully implemented augmented reality in their consultations and their outcomes.<|end_of_text|> <|assistant|> Beauty consultants use augmented reality technology to help customers visualize how different products will look on them before making a purchase. Augmented reality is a technology that integrates digital content into the user's physical environment, enhancing the real-world experience. Augmented reality is utilized in the beauty industry through apps, mirrors, and specialized devices. One type of augmented reality used by beauty consultants is virtual try-on. Virtual try-on allows customers to try on different makeup looks, hairstyles, and even accessories virtually. The technology uses a camera to scan the customer's face, and then it maps virtual makeup or hair onto the customer's face, giving a realistic idea of how the product will look before making a purchase. This technology is especially helpful for people who are hesitant to try new products or have sensitive skin and can't try on products in-store. Another type of augmented reality used by beauty consultants is facial recognition. Facial recognition technology analyzes the customer's face and identifies their unique features, such as skin tone and facial shape. By analyzing these features, the system can provide personalized product recommendations that would work best for the customer's skin type and complexion. This technology has become popular in skincare consultations, as it provides the customer with a tailored skincare regimen that addresses their unique needs. Finally, augmented reality is also used in skin analysis. This technology allows beauty consultants to assess skin conditions such as wrinkles, sun damage, and acne. By analyzing the customer's skin, beauty consultants can recommend personalized skincare products. To integrate augmented reality technology into their workflow, beauty consultants use specialized apps or devices, such as handheld cameras or mirrors equipped with augmented reality capabilities. Customers can also access augmented reality technology in-store through interactive kiosks or mobile apps. Implementing augmented reality technology in customer consultations has many benefits. This technology helps enhance the customer's experience, as it allows them to visualize how the product will look before making a purchase. It also provides customers with personalized recommendations, which boosts customer satisfaction and can lead to an increase in sales. Several beauty brands have successfully implemented augmented reality in their consultations. For example, L'Oreal's Makeup Genius app allows customers to virtually try on different makeup looks, and Sephora's Virtual Artist app uses facial recognition technology to offer personalized recommendations. Both apps have received positive reviews, with customers appreciating the convenience and personalized recommendations. In conclusion, augmented reality technology is becoming increasingly popular in the beauty industry. Virtual try-on, facial recognition, and skin analysis technology all help enhance the customer experience, provide tailored recommendations and improve customer satisfaction. Beauty brands that have successfully implemented augmented reality technology in their consultations have seen an increase in sales and positive customer feedback.<|end_of_text|> <|user|> Can you elaborate more on how augmented reality technology is used in virtual try-ons and how it has improved the customer experience for beauty brands? Also, can you recommend any specific beauty brands that have seen a significant increase in sales since implementing augmented reality technology in their consultations?<|end_of_text|> <|assistant|> Certainly! Augmented reality technology in virtual try-ons allows customers to see how different makeup and hair products look on their face or head without physically trying them on. By using a tool such as a mobile app or camera, customers can see realistic 3D renderings of how the products will look on them, without having to apply or remove any physical makeup or hair products. This makes it much easier for customers to visualize if a particular product will work for them or not before they make a purchase. Augmented reality in virtual try-ons has drastically improved the customer experience for beauty brands because it helps customers make more confident, informed purchasing decisions. Customers can experiment with different makeup looks or hairstyles virtually, without the risk of wasting money on a product that might not work for them. This technology has also helped to create a more immersive and exciting experience for customers, which can lead to stronger loyalty and increased brand perception. As for brands that have seen a significant increase in sales since implementing augmented reality technology in their consultations, one great example is L'Oreal's Makeup Genius app. Since launching in 2014, the app has been downloaded over 20 million times and has helped to increase L'Oreal's online sales. Additionally, Estée Lauder saw a 65% increase in sales of their Double Wear foundation after launching an augmented reality try-on tool in their mobile app. Finally, Sephora's Virtual Artist app has also seen success with over 10 million downloads since launching in 2016. Overall, augmented reality technology in virtual try-ons has drastically improved the customer experience for beauty brands by allowing customers to make more informed decisions and create an immersive shopping experience. As more brands begin to implement this technology, it's likely we will continue to see significant increases in sales and stronger customer loyalty as a result.<|end_of_text|> <|user|> Can you offer more information about how augmented reality technology is used in skincare consultations and recommend some specific beauty brands that have successfully integrated this technology in their consultations? Additionally, can you explain further how beauty brands can use augmented reality technology to enhance customer satisfaction and increase sales?<|end_of_text|> <|assistant|> Certainly! Augmented reality technology is used in skincare consultations by helping beauty consultants analyze the customer's skin and offer personalized recommendations based on their specific needs. This is primarily done by using facial recognition technology to identify the customer's unique features, including skin tone, texture, and any areas of concern. The consultant can then use augmented reality technology to show the customer how a particular product will help address those issues, and even track the effectiveness of the product over time. A great example of a beauty brand that has successfully integrated this technology into their consultations is Dermalogica. Their Face Mapping technology uses facial recognition technology to analyze the customer's skin and offer personalized product recommendations based on their skin type and concerns. Dermalogica also recently launched Skin Analysis 2.0, which uses machine learning to analyze customer's skin in even more detail, to provide more targeted recommendations. As for how augmented reality technology can enhance customer satisfaction and increase sales, there are several key ways. First, it helps customers make more informed purchasing decisions by providing personalized recommendations based on their individual needs, which makes them feel more confident in their choices. Second, it creates a more immersive and exciting shopping experience, which can help to improve customer retention and loyalty. Finally, it can help to simplify and streamline the consultation process, which can increase efficiency and allow consultants to assist more customers. One great example of a beauty brand that has successfully used augmented reality technology to enhance customer satisfaction and increase sales is MAC Cosmetics. They have a virtual try-on tool that allows customers to try on different eyeshadow shades and save their favorite looks to easily purchase later. The tool has been incredibly popular, contributing to significant sales growth for the brand. Overall, augmented reality technology has great potential in the beauty industry to enhance the customer experience, increase sales, and improve efficiency. Beauty brands that successfully integrate this technology into their consultations will likely continue to see strong sales growth and increased customer satisfaction.<|end_of_text|> 2024-10-10 13:03:40 - INFO - __main__ - Sample 29184 of the processed training set: <|system|> <|end_of_text|> <|user|> How can we encourage more public engagement in conservation efforts and create a sense of responsibility towards protecting our natural environment?<|end_of_text|> <|assistant|> There are several ways to encourage public engagement in conservation efforts and create a sense of responsibility towards protecting our natural environment: 1. Raise awareness: Conduct public awareness campaigns through various mediums like social media, television, and radio to educate people about the importance of conservation and how they can contribute to it. 2. Community involvement: Encourage individuals and local communities to come together for conservation projects like cleaning up local parks and beaches. 3. Volunteering: Organize volunteering opportunities for people to participate in conservation efforts by providing them with the necessary tools and resources. 4. Education: Promote eco-friendly lifestyles by teaching children about the environment in schools and encouraging adults to attend workshops and seminars. 5. Collaborate with local organizations: Collaborate with local environmental organizations to create sustainable programs that will benefit the community and the environment. 6. Rewards and incentives: Offer incentives and rewards for people who take part in conservation efforts to motivate them to continue their efforts. 7. Government initiatives: Encourage the government to introduce policies that support conservation efforts to protect natural habitats, wildlife, and the environment. By using these strategies, we can encourage public engagement in conservation efforts and create a sense of responsibility towards protecting our natural environment.<|end_of_text|> <|user|> I really like the idea of offering incentives for people who participate in conservation efforts. What kind of incentives do you think would be effective?<|end_of_text|> <|assistant|> There are several incentives that could be effective in encouraging people to participate in conservation efforts. Some examples are: 1. Recognition: Acknowledge and celebrate the efforts of people who participate in conservation efforts, for example, by awarding certificates, hosting events, or mentioning them on social media. 2. Discounts: Offer discounts or coupons for local businesses to those who participate in conservation efforts. This could encourage people to not only participate in conservation but also support smaller businesses in their community. 3. Prizes: Offer the chance to win prizes like books on environmentalism, eco-friendly products, or even eco-tourism trips to those who participate in conservation efforts. 4. Volunteer hour credits: Give credits to volunteers for their hours worked, giving them a tangible benefit while helping to track their contributions to the environment. 5. Free admission: Offer free admission to parks, zoos, aquariums, and other natural attractions for those who participate in conservation efforts. 6. Tax incentives: Offer tax benefits to individuals or businesses who participate in conservation efforts, giving them another reason to get involved. These incentives could be tailored to suit the interests and needs of the local community and may be more or less effective depending on the target audience.<|end_of_text|> <|user|> I think that offering free admission to natural attractions for those who participate in conservation efforts is a great idea! It's a win-win situation for everyone.<|end_of_text|> <|assistant|> Yes, offering free admission to natural attractions is a great incentive for those who participate in conservation efforts because it not only encourages people to get involved in conservation work but also gives them the opportunity to enjoy and appreciate the beauty of nature. Additionally, it benefits the attractions themselves as they receive more visitors as a result of the incentive. This can be a great way to promote ecotourism and raise awareness about conservation efforts. By doing this, we can show people how important it is to protect our natural environment and that they can make a difference.<|end_of_text|> <|user|> I also think it would be great if businesses offered discounts or coupons for those who participate in conservation efforts. It's a good way to support local businesses while promoting conservation.<|end_of_text|> <|assistant|> Yes, offering discounts or coupons for local businesses is a great way to support the local economy while also promoting conservation. By encouraging people to participate in conservation efforts, local businesses can attract new customers and show their support for important environmental causes. This is an excellent way to strengthen the connection between businesses and the community while also fostering a sense of responsibility towards environmental preservation. This initiative can also lead to a greater sense of community pride and engagement in the local area as people feel that they are making a positive impact on their own community.<|end_of_text|> <|user|> I also think that organizing workshops and seminars for adults could be really effective. I know I personally would love to learn more about how I can make small changes in my daily life to help protect the environment.<|end_of_text|> <|assistant|> Yes, organizing workshops and seminars for adults can be very effective in promoting awareness and education about conservation efforts. These events can provide valuable information on ways individuals can make small changes in their daily lives to help preserve the environment. The workshops and seminars can cover topics such as reducing waste, composting, energy conservation, and sustainable living practices. These events can also provide an excellent opportunity for people to meet and network with like-minded individuals who share a passion for environmental causes. In addition, inviting guest speakers who are experts on various environmental topics can offer more in-depth knowledge and information to attendees. This will provide attendees with a chance to learn and ask questions and encourage them to be more engaged in conservation efforts. Overall, organizing workshops and seminars is a great way to educate adults and empower them to take an active role in protecting the environment.<|end_of_text|> <|user|> I also think it would be great to promote sustainable living practices in schools so that the younger generation can learn to be mindful of the environment from a young age.<|end_of_text|> <|assistant|> Yes, promoting sustainable living practices in schools is a great way to teach young people about the importance of conservation and environmental protection. Educating children about the environment from a young age can help cultivate a sense of responsibility and commitment towards conservation efforts. It can also encourage them to develop environmentally conscious habits and practices that they can carry with them throughout life. By teaching sustainable living practices such as composting, recycling, conserving energy, and reducing waste, we can prepare the next generation to be mindful of their impact on the environment. Furthermore, schools can organize programs that focus on outdoor education and nature-based learning. These initiatives can help children develop a deeper appreciation and understanding of the natural world. They can also provide children with valuable opportunities to participate in conservation efforts, such as planting trees, cleaning up local parks, and monitoring local wildlife. By introducing conservation education in schools, we can ensure that future generations are equipped with the knowledge and tools necessary to protect the environment.<|end_of_text|> <|user|> I also think it would be great to have more eco-friendly options available in stores to make it easier for people to live sustainably. Do you have any ideas for how stores can promote eco-friendly options?<|end_of_text|> <|assistant|> Yes, having more eco-friendly options available in stores is a great way to make sustainable living more accessible for everyone. Here are some ideas for how stores can promote eco-friendly options: 1. Display eco-friendly products prominently: Stores can display eco-friendly products prominently in high-traffic areas and near checkout counters to increase their visibility and encourage people to consider purchasing them. 2. Provide information on the environmental benefits: Stores can provide information to shoppers on the environmental benefits of the eco-friendly products they carry. This can be done through product labeling, in-store signage, or by providing educational materials. 3. Offer discounts and promotions: Stores can offer discounts and promotions for eco-friendly products to incentivize people to purchase them over less environmentally-friendly alternatives. Discounts can also be given to shoppers who bring their own bags, reducing the amount of plastic bag waste. 4. Collaborate with environmental organizations: Stores can collaborate with local environmental organizations to promote eco-friendly living and host events related to sustainability. This could include organizing educational campaigns on recycling at outdoor markets, for example. 5. Use more eco-friendly packaging: Stores can switch to more eco-friendly packaging options, such as biodegradable bags and boxes or encourage customers to bring their own reusable bags or containers. By offering more eco-friendly options and promoting sustainable products, stores can help make it easier for people to live sustainably and make a positive impact on the environment.<|end_of_text|> 2024-10-10 13:03:40 - INFO - __main__ - Sample 6556 of the processed training set: <|system|> <|end_of_text|> <|user|> Write a coming-of-age play that explores the challenges and realities of growing up with a group of friends as the central characters. The play should delve into themes of identity, peer pressure, family dynamics, and the transition from adolescence to adulthood. The style of the play can be realistic, absurdist, or a blend of both, and should include both comedic and dramatic elements. The characters should be multi-dimensional and relatable, and the plot should have a clear arc that builds to a satisfying resolution.<|end_of_text|> <|assistant|> Title: Growing Up Together Act 1 Scene 1: A group of friends, Lucy, Jake, Hannah, and Mike, sit on a bench outside a convenience store. They’re all 16 years old and in high school. They talk about random things that only teenagers talk about, like school, crushes, and the latest video games. Scene 2: At Hannah’s house – she lives with her dad and sister. Her dad is always working, and her sister is too busy with her own life. Hannah shares with her friends how tough it is to live with a dad that never has time for her or her sister. Scene 3: At Mike’s house – he lives with both his parents, but they’re always arguing. Mike tells his friends how the constant bickering affects him, and he compares his family to divorce, wishing his parents would separate to be happier. Act 2 Scene 1: Lucy’s best friend from childhood, Jess, moves back to town. Lucy feels the pressure to be with her friend again, but worries about losing her current group of friends. Scene 2: Jake and Mike start drinking, smoking, and partying. They try to get the girls involved, but they all decline, causing a rift in the group. Scene 3: Hannah’s dad gets into a severe accident that leaves him unable to work. She has to step up and become the caretaker for both her dad and sister. Act 3 Scene 1: The friends start to argue and drift away from one another. Lucy feels torn between Jess and her current friends, while Jake and Mike become too focused on their vices. Scene 2: Hannah has to face adult responsibilities as the caretaker of her family. She starts to distance herself from her friends, wanting to make her family her top priority. Scene 3: The friends come together for one last dinner and realize that they’ve all changed, grown-up, and become different people. They all leave to start their new chapters with a newfound appreciation for each other. Epilogue Each of the four friends narrates their new journey, telling the audience what they decided to do, where they are heading, and what they hope to achieve in the future as an adult. Despite the hardships they faced, they know they will always be there for each other.<|end_of_text|> <|user|> I like the idea of the play, but can we delve deeper into the family dynamics of the characters? I think that would add an interesting layer to the story.<|end_of_text|> <|assistant|> Absolutely! I agree that exploring the family dynamics of the characters would add depth to the story. Act 1 Scene 1: A group of friends, Lucy, Jake, Hannah, and Mike, sit on a bench outside a convenience store. They’re all 16 years old and in high school. They talk about random things that only teenagers talk about, like school, crushes, and the latest video games. Scene 2: At Hannah’s house – she lives with her dad and sister. Her dad is always working, and her sister is too busy with her own life. Hannah shares with her friends how tough it is to live with a dad that never has time for her or her sister. She takes on most of the household responsibilities, which leaves her feeling overwhelmed and lonely. Scene 3: At Mike’s house – he lives with both his parents, but they're always arguing. His dad has a drinking problem that affects his relationships with the rest of the family. Mike tells his friends how the constant bickering affects him, and he compares his family to divorce, wishing his parents would separate to be happier. Act 2 Scene 1: Lucy’s best friend from childhood, Jess, moves back to town. Lucy feels the pressure to be with her friend again, but worries about losing her current group of friends. Her family is supportive, but her mother is overbearing and has a tendency to push her towards certain career paths. Scene 2: Jake’s family recently moved to the area, and he struggles to fit in with the group. His parents are supportive but have high expectations for him, which leads to tension between them. Scene 3: Hannah’s dad gets into a severe accident that leaves him unable to work. She has to step up and become the caretaker for both her dad and sister. Her friends are supportive, but her family has a tendency to view her as the "responsible" one and put more pressure on her. Act 3 Scene 1: The friends start to argue and drift away from one another. Lucy feels torn between Jess and her current friends, while Jake and Mike become too focused on their vices. Hannah struggles to maintain her family responsibilities and her friendships. Scene 2: Hannah has to face adult responsibilities as the caretaker of her family. She starts to distance herself from her friends, wanting to make her family her top priority. She feels guilty for neglecting her friends but finds it challenging to balance everything. Scene 3: The friends come together for one last dinner and realize that they’ve all changed, grown-up, and become different people. They all leave to start their new chapters with a newfound appreciation for each other and their families. Epilogue Each of the four friends narrates their new journey, telling the audience what they decided to do, where they are heading, and what they hope to achieve in the future as an adult. Despite the hardships they faced, they know they will always be there for each other and their families.<|end_of_text|> <|user|> Hey, the family dynamics added a lot of depth to the play, but I think we could explore it even further. Can you add some scenes where the characters struggle to balance their family responsibilities and their friendships? Also, can we see how their family dynamics affect their personal growth and identity?<|end_of_text|> <|assistant|> Absolutely! Here's a revised version that delves even deeper into the family dynamics of the characters, explores their struggle to balance family responsibilities and friendships, and shows how their family dynamics affect their personal growth and identity: Act 1 Scene 1: A group of friends, Lucy, Jake, Hannah, and Mike, sit on a bench outside a convenience store. They’re all 16 years old and in high school. They talk about random things that only teenagers talk about, like school, crushes, and the latest video games. Scene 2: At Hannah’s house – she lives with her dad and sister. Her dad is always working, and her sister is too busy with her own life. Hannah takes on most of the household responsibilities, which leaves her feeling overwhelmed and lonely. She shares with her friends how tough it is to balance her family responsibility and her friendships. Scene 3: At Mike's house – he lives with both his parents, but they're always arguing. His dad has a drinking problem that affects his relationships with the rest of the family. Mike tells his friends how the constant bickering affects him, and he compares his family to divorce, wishing his parents would at least try to be happy together. This struggle with his parents' relationship leaves him feeling lost and unsure of his place in the world. Act 2 Scene 1: Lucy’s best friend from childhood, Jess, moves back to town. Lucy feels the pressure to be with her friend again, but worries about losing her current group of friends. Her family is supportive, but her mother is overbearing and has a tendency to push her towards certain career paths. This makes Lucy question who she really is and what she wants to do with her future. Scene 2: Jake’s family recently moved to the area, and he struggles to fit in with the group. His parents have high expectations for him academically and athletically which leads to tension between them. Jake feels like he can never live up to his parents' expectations and starts to feel like he's not good enough. Scene 3: Hannah’s dad gets into a severe accident that leaves him unable to work. She has to step up and become the caretaker for both her dad and sister. Her friends are supportive, but balancing her family responsibilities and her friendships becomes even harder. She starts to feel like she's lost her own identity and has become nothing more than a caretaker. Act 3 Scene 1: The friends start to argue and drift away from one another. Lucy feels torn between Jess and her current friends, while Jake and Mike become too focused on their vices. Hannah struggles to maintain her family responsibilities and her friendships. All of them begin to wonder if it's worth it to keep holding on to their friendships or if it's time to let go. Scene 2: After some time has passed, each character has a moment where they realize the importance of their family and their friendships. Even though it's hard to balance both, they know they can't just let go of the people who have been with them through everything. Lucy stands up to her mother and decides to pursue her own dreams, Jake learns to take pride in who he is and not just what his parents expect him to be, Mike starts to confront his own problems within his family, and Hannah learns to ask for help so she can find a balance between caring for her family and herself. Scene 3: The friends come together for one last dinner and realize that they’ve all changed, grown-up, and become different people. They all leave to start their new chapters with a newfound appreciation for each other and their families. They know that it's not going to be easy, but they're ready to face whatever hurdles come their way. Epilogue: Each of the four friends narrates their new journey, telling the audience what they decided to do, where they are heading, and what they hope to achieve in the future as an adult. Despite the hardships they faced, they know they will always be there for each other and their families. They've learned that growing up isn't about finding yourself, but rather creating yourself and sometimes that means taking a step back to care for those who have always been there.<|end_of_text|> <|user|> Hey, I really like the additions you made to the play! Can we add a scene where the characters have a heart-to-heart conversation with their families about their struggles with balancing family responsibilities and friendships? I think it would be a great way to show how important communication is in maintaining relationships.<|end_of_text|> <|assistant|> Definitely! Here's a revised version of the play with the added scene: Act 1 Scene 1: A group of friends, Lucy, Jake, Hannah, and Mike, sit on a bench outside a convenience store. They’re all 16 years old and in high school. They talk about random things that only teenagers talk about, like school, crushes, and the latest video games. Scene 2: At Hannah’s house – she lives with her dad and sister. Her dad is always working, and her sister is too busy with her own life. Hannah takes on most of the household responsibilities, which leaves her feeling overwhelmed and lonely. She shares with her friends how tough it is to balance her family responsibility and her friendships. Scene 3: At Mike's house – he lives with both his parents, but they're always arguing. His dad has a drinking problem that affects his relationships with the rest of the family. Mike tells his friends how the constant bickering affects him, and he compares his family to divorce, wishing his parents would try to be happy together. This struggle with his parents' relationship leaves him feeling lost and unsure of his place in the world. Act 2 Scene 1: Lucy’s best friend from childhood, Jess, moves back to town. Lucy feels the pressure to be with her friend again, but worries about losing her current group of friends. Her family is supportive, but her mother is overbearing and has a tendency to push her towards certain career paths. This makes Lucy question who she really is and what she wants to do with her future. Scene 2: Jake’s family recently moved to the area, and he struggles to fit in with the group. His parents have high expectations for him academically and athletically, which leads to tension between them. Jake feels like he can never live up to his parents' expectations and starts to feel like he's not good enough. Scene 3: Hannah’s dad gets into a severe accident that leaves him unable to work. She has to step up and become the caretaker for both her dad and sister. Her friends are supportive, but balancing her family responsibilities and her friendships becomes even harder. She starts to feel like she's lost her own identity and has become nothing more than a caretaker. Act 3 Scene 1: The friends start to argue and drift away from one another. Lucy feels torn between Jess and her current friends, while Jake and Mike become too focused on their vices. Hannah struggles to maintain her family responsibilities and her friendships. All of them begin to wonder if it's worth it to keep holding on to their friendships or if it's time to let go. Scene 2: The friends have a heart-to-heart conversation with their families about their struggles with balancing family responsibilities and friendships. Lucy talks to her mom about her desire to pursue her own dreams, Jake and his parents come to a mutual understanding about the pressure that has been placed upon him, Mike tells his parents how their constant arguing is affecting him, and Hannah's family offers to lend a helping hand. Through this conversation, the characters learn the importance of communication and the power of support from their families. Scene 3: After some time has passed, each character has a moment where they realize the importance of their family and their friendships. Even though it's hard to balance both, they know they can't just let go of the people who have been with them through everything. Lucy stands up for herself and decides to pursue her own dreams, Jake learns to take pride in who he is and not just what his parents expect him to be, Mike starts to confront his own problems within his family, and Hannah learns to ask for help so she can find a balance between caring for her family and herself. Epilogue: Each of the four friends narrates their new journey, telling the audience what they decided to do, where they are heading, and what they hope to achieve in the future as an adult. Despite the hardships they faced, they know they will always be there for each other and their families. They've learned that growing up isn't about finding yourself, but rather creating yourself, and sometimes that means taking a step back to care for those who have always been there.<|end_of_text|> Applying chat template (num_proc=12): 100%|██████████| 23110/23110 [00:02<00:00, 9991.77 examples/s] Applying chat template (num_proc=12): 100%|██████████| 23110/23110 [00:02<00:00, 10043.55 examples/s] Applying chat template (num_proc=12): 100%|██████████| 23110/23110 [00:02<00:00, 10044.11 examples/s] Applying chat template (num_proc=12): 100%|██████████| 23110/23110 [00:02<00:00, 10258.28 examples/s] Applying chat template (num_proc=12): 80%|████████ | 18585/23110 [00:02<00:00, 14085.57 examples/s] Applying chat template (num_proc=12): 100%|██████████| 23110/23110 [00:02<00:00, 13381.57 examples/s] Applying chat template (num_proc=12): 100%|██████████| 23110/23110 [00:02<00:00, 9864.26 examples/s] NCCL version 2.20.5+cuda12.4 Applying chat template (num_proc=12): 88%|████████▊ | 20344/23110 [00:02<00:00, 14047.76 examples/s] Applying chat template (num_proc=12): 100%|██████████| 23110/23110 [00:02<00:00, 9655.81 examples/s] Applying chat template (num_proc=12): 96%|█████████▌| 22142/23110 [00:02<00:00, 13655.16 examples/s] Applying chat template (num_proc=12): 100%|██████████| 23110/23110 [00:02<00:00, 8584.48 examples/s] /workspace/jihaozhe/miniconda3/envs/handbook/lib/python3.10/site-packages/huggingface_hub/utils/_deprecation.py:100: FutureWarning: Deprecated argument(s) used in '__init__': model_init_kwargs, dataset_text_field, max_seq_length, packing. Will not be supported from version '1.0.0'. Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead. warnings.warn(message, FutureWarning) /workspace/jihaozhe/miniconda3/envs/handbook/lib/python3.10/site-packages/huggingface_hub/utils/_deprecation.py:100: FutureWarning: Deprecated argument(s) used in '__init__': model_init_kwargs, dataset_text_field, max_seq_length, packing. Will not be supported from version '1.0.0'. Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead. warnings.warn(message, FutureWarning) /workspace/jihaozhe/miniconda3/envs/handbook/lib/python3.10/site-packages/huggingface_hub/utils/_deprecation.py:100: FutureWarning: Deprecated argument(s) used in '__init__': model_init_kwargs, dataset_text_field, max_seq_length, packing. Will not be supported from version '1.0.0'. Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead. warnings.warn(message, FutureWarning) /workspace/jihaozhe/miniconda3/envs/handbook/lib/python3.10/site-packages/huggingface_hub/utils/_deprecation.py:100: FutureWarning: Deprecated argument(s) used in '__init__': model_init_kwargs, dataset_text_field, max_seq_length, packing. Will not be supported from version '1.0.0'. Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead. warnings.warn(message, FutureWarning) /workspace/jihaozhe/miniconda3/envs/handbook/lib/python3.10/site-packages/huggingface_hub/utils/_deprecation.py:100: FutureWarning: Deprecated argument(s) used in '__init__': model_init_kwargs, dataset_text_field, max_seq_length, packing. Will not be supported from version '1.0.0'. Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead. warnings.warn(message, FutureWarning) /workspace/jihaozhe/miniconda3/envs/handbook/lib/python3.10/site-packages/huggingface_hub/utils/_deprecation.py:100: FutureWarning: Deprecated argument(s) used in '__init__': model_init_kwargs, dataset_text_field, max_seq_length, packing. Will not be supported from version '1.0.0'. Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead. warnings.warn(message, FutureWarning) /workspace/jihaozhe/miniconda3/envs/handbook/lib/python3.10/site-packages/huggingface_hub/utils/_deprecation.py:100: FutureWarning: Deprecated argument(s) used in '__init__': model_init_kwargs, dataset_text_field, max_seq_length, packing. Will not be supported from version '1.0.0'. Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead. warnings.warn(message, FutureWarning) /workspace/jihaozhe/miniconda3/envs/handbook/lib/python3.10/site-packages/huggingface_hub/utils/_deprecation.py:100: FutureWarning: Deprecated argument(s) used in '__init__': model_init_kwargs, dataset_text_field, max_seq_length, packing. Will not be supported from version '1.0.0'. Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead. warnings.warn(message, FutureWarning) /workspace/jihaozhe/miniconda3/envs/handbook/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:158: UserWarning: You passed `model_init_kwargs` to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( /workspace/jihaozhe/miniconda3/envs/handbook/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:158: UserWarning: You passed `model_init_kwargs` to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( /workspace/jihaozhe/miniconda3/envs/handbook/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:158: UserWarning: You passed `model_init_kwargs` to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( /workspace/jihaozhe/miniconda3/envs/handbook/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:158: UserWarning: You passed `model_init_kwargs` to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( /workspace/jihaozhe/miniconda3/envs/handbook/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:158: UserWarning: You passed `model_init_kwargs` to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( /workspace/jihaozhe/miniconda3/envs/handbook/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:185: UserWarning: You passed a model_id to the SFTTrainer. This will automatically create an `AutoModelForCausalLM` or a `PeftModel` (if you passed a `peft_config`) for you. warnings.warn( /workspace/jihaozhe/miniconda3/envs/handbook/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:158: UserWarning: You passed `model_init_kwargs` to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( /workspace/jihaozhe/miniconda3/envs/handbook/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:185: UserWarning: You passed a model_id to the SFTTrainer. This will automatically create an `AutoModelForCausalLM` or a `PeftModel` (if you passed a `peft_config`) for you. warnings.warn( /workspace/jihaozhe/miniconda3/envs/handbook/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:158: UserWarning: You passed `model_init_kwargs` to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( /workspace/jihaozhe/miniconda3/envs/handbook/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:185: UserWarning: You passed a model_id to the SFTTrainer. This will automatically create an `AutoModelForCausalLM` or a `PeftModel` (if you passed a `peft_config`) for you. warnings.warn( /workspace/jihaozhe/miniconda3/envs/handbook/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:185: UserWarning: You passed a model_id to the SFTTrainer. This will automatically create an `AutoModelForCausalLM` or a `PeftModel` (if you passed a `peft_config`) for you. warnings.warn( /workspace/jihaozhe/miniconda3/envs/handbook/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:185: UserWarning: You passed a model_id to the SFTTrainer. This will automatically create an `AutoModelForCausalLM` or a `PeftModel` (if you passed a `peft_config`) for you. warnings.warn( /workspace/jihaozhe/miniconda3/envs/handbook/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:185: UserWarning: You passed a model_id to the SFTTrainer. This will automatically create an `AutoModelForCausalLM` or a `PeftModel` (if you passed a `peft_config`) for you. warnings.warn( /workspace/jihaozhe/miniconda3/envs/handbook/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:185: UserWarning: You passed a model_id to the SFTTrainer. This will automatically create an `AutoModelForCausalLM` or a `PeftModel` (if you passed a `peft_config`) for you. warnings.warn( /workspace/jihaozhe/miniconda3/envs/handbook/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:158: UserWarning: You passed `model_init_kwargs` to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( /workspace/jihaozhe/miniconda3/envs/handbook/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:185: UserWarning: You passed a model_id to the SFTTrainer. This will automatically create an `AutoModelForCausalLM` or a `PeftModel` (if you passed a `peft_config`) for you. warnings.warn( [INFO|configuration_utils.py:731] 2024-10-10 13:03:46,631 >> loading configuration file /workspace/jihaozhe/models/Llama-3.2-1B/config.json [INFO|configuration_utils.py:800] 2024-10-10 13:03:46,632 >> Model config LlamaConfig { "_name_or_path": "/workspace/jihaozhe/models/Llama-3.2-1B", "architectures": [ "LlamaForCausalLM" ], "attention_bias": false, "attention_dropout": 0.0, "bos_token_id": 128000, "eos_token_id": 128001, "head_dim": 64, "hidden_act": "silu", "hidden_size": 2048, "initializer_range": 0.02, "intermediate_size": 8192, "max_position_embeddings": 131072, "mlp_bias": false, "model_type": "llama", "num_attention_heads": 32, "num_hidden_layers": 16, "num_key_value_heads": 8, "pretraining_tp": 1, "rms_norm_eps": 1e-05, "rope_scaling": { "factor": 32.0, "high_freq_factor": 4.0, "low_freq_factor": 1.0, "original_max_position_embeddings": 8192, "rope_type": "llama3" }, "rope_theta": 500000.0, "tie_word_embeddings": true, "torch_dtype": "bfloat16", "transformers_version": "4.44.2", "use_cache": false, "vocab_size": 128256 } [INFO|modeling_utils.py:3675] 2024-10-10 13:03:46,642 >> loading weights file /workspace/jihaozhe/models/Llama-3.2-1B/model.safetensors [INFO|modeling_utils.py:1606] 2024-10-10 13:03:47,094 >> Instantiating LlamaForCausalLM model under default dtype torch.bfloat16. [INFO|modeling_utils.py:3820] 2024-10-10 13:03:47,094 >> Detected DeepSpeed ZeRO-3: activating zero.init() for this model [2024-10-10 13:03:47,094] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 8 [2024-10-10 13:03:47,094] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 8 [2024-10-10 13:03:47,094] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 8 [2024-10-10 13:03:47,094] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 8 [2024-10-10 13:03:47,094] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 8 [2024-10-10 13:03:47,094] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 8 [2024-10-10 13:03:47,094] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 8 [2024-10-10 13:03:47,094] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 8 [WARNING|logging.py:328] 2024-10-10 13:03:47,102 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. [WARNING|logging.py:328] 2024-10-10 13:03:47,103 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. [WARNING|logging.py:328] 2024-10-10 13:03:47,104 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. [WARNING|logging.py:328] 2024-10-10 13:03:47,104 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. [WARNING|logging.py:328] 2024-10-10 13:03:47,104 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. [WARNING|logging.py:328] 2024-10-10 13:03:47,104 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. [WARNING|logging.py:328] 2024-10-10 13:03:47,104 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. [WARNING|logging.py:328] 2024-10-10 13:03:47,104 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. [INFO|configuration_utils.py:1038] 2024-10-10 13:03:47,112 >> Generate config GenerationConfig { "bos_token_id": 128000, "eos_token_id": 128001, "use_cache": false } [2024-10-10 13:03:47,694] [INFO] [partition_parameters.py:348:__exit__] finished initializing model - num_params = 147, num_elems = 1.50B /workspace/jihaozhe/miniconda3/envs/handbook/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:195: UserWarning: You passed a `packing` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( /workspace/jihaozhe/miniconda3/envs/handbook/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:283: UserWarning: You passed a `max_seq_length` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( /workspace/jihaozhe/miniconda3/envs/handbook/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:195: UserWarning: You passed a `packing` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( /workspace/jihaozhe/miniconda3/envs/handbook/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:195: UserWarning: You passed a `packing` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( /workspace/jihaozhe/miniconda3/envs/handbook/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:321: UserWarning: You passed a `dataset_text_field` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( /workspace/jihaozhe/miniconda3/envs/handbook/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:283: UserWarning: You passed a `max_seq_length` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( /workspace/jihaozhe/miniconda3/envs/handbook/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:283: UserWarning: You passed a `max_seq_length` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( /workspace/jihaozhe/miniconda3/envs/handbook/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:321: UserWarning: You passed a `dataset_text_field` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( /workspace/jihaozhe/miniconda3/envs/handbook/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:321: UserWarning: You passed a `dataset_text_field` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( /workspace/jihaozhe/miniconda3/envs/handbook/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:195: UserWarning: You passed a `packing` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( /workspace/jihaozhe/miniconda3/envs/handbook/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:283: UserWarning: You passed a `max_seq_length` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( /workspace/jihaozhe/miniconda3/envs/handbook/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:195: UserWarning: You passed a `packing` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( /workspace/jihaozhe/miniconda3/envs/handbook/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:195: UserWarning: You passed a `packing` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( /workspace/jihaozhe/miniconda3/envs/handbook/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:321: UserWarning: You passed a `dataset_text_field` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( /workspace/jihaozhe/miniconda3/envs/handbook/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:283: UserWarning: You passed a `max_seq_length` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( /workspace/jihaozhe/miniconda3/envs/handbook/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:283: UserWarning: You passed a `max_seq_length` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( /workspace/jihaozhe/miniconda3/envs/handbook/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:321: UserWarning: You passed a `dataset_text_field` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( /workspace/jihaozhe/miniconda3/envs/handbook/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:321: UserWarning: You passed a `dataset_text_field` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( /workspace/jihaozhe/miniconda3/envs/handbook/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:195: UserWarning: You passed a `packing` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( /workspace/jihaozhe/miniconda3/envs/handbook/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:283: UserWarning: You passed a `max_seq_length` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( /workspace/jihaozhe/miniconda3/envs/handbook/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:321: UserWarning: You passed a `dataset_text_field` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( [INFO|modeling_utils.py:4507] 2024-10-10 13:03:52,494 >> All model checkpoint weights were used when initializing LlamaForCausalLM. [INFO|modeling_utils.py:4515] 2024-10-10 13:03:52,494 >> All the weights of LlamaForCausalLM were initialized from the model checkpoint at /workspace/jihaozhe/models/Llama-3.2-1B. If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training. [INFO|configuration_utils.py:991] 2024-10-10 13:03:52,497 >> loading configuration file /workspace/jihaozhe/models/Llama-3.2-1B/generation_config.json [INFO|configuration_utils.py:1038] 2024-10-10 13:03:52,497 >> Generate config GenerationConfig { "bos_token_id": 128000, "do_sample": true, "eos_token_id": 128001, "temperature": 0.6, "top_p": 0.9 } /workspace/jihaozhe/miniconda3/envs/handbook/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:195: UserWarning: You passed a `packing` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( /workspace/jihaozhe/miniconda3/envs/handbook/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:283: UserWarning: You passed a `max_seq_length` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( /workspace/jihaozhe/miniconda3/envs/handbook/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:321: UserWarning: You passed a `dataset_text_field` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( Using custom data configuration default-2bbb001ed7b41534 2024-10-10 13:03:52 - INFO - datasets.builder - Using custom data configuration default-2bbb001ed7b41534 Loading Dataset Infos from /workspace/jihaozhe/miniconda3/envs/handbook/lib/python3.10/site-packages/datasets/packaged_modules/generator 2024-10-10 13:03:52 - INFO - datasets.info - Loading Dataset Infos from /workspace/jihaozhe/miniconda3/envs/handbook/lib/python3.10/site-packages/datasets/packaged_modules/generator Generating dataset generator (/root/.cache/huggingface/datasets/generator/default-2bbb001ed7b41534/0.0.0) 2024-10-10 13:03:52 - INFO - datasets.builder - Generating dataset generator (/root/.cache/huggingface/datasets/generator/default-2bbb001ed7b41534/0.0.0) Downloading and preparing dataset generator/default to /root/.cache/huggingface/datasets/generator/default-2bbb001ed7b41534/0.0.0... 2024-10-10 13:03:52 - INFO - datasets.builder - Downloading and preparing dataset generator/default to /root/.cache/huggingface/datasets/generator/default-2bbb001ed7b41534/0.0.0... Generating train split 2024-10-10 13:03:52 - INFO - datasets.builder - Generating train split Generating train split: 0 examples [00:00, ? examples/s][WARNING|tokenization_utils_base.py:4101] 2024-10-10 13:03:55,201 >> Token indices sequence length is longer than the specified maximum sequence length for this model (2371 > 2048). Running this sequence through the model will result in indexing errors Generating train split: 1 examples [00:02, 2.61s/ examples] Generating train split: 455 examples [00:02, 236.14 examples/s] Generating train split: 755 examples [00:05, 154.82 examples/s] Generating train split: 1000 examples [00:05, 231.65 examples/s] Generating train split: 1495 examples [00:05, 450.64 examples/s] Generating train split: 1995 examples [00:08, 288.88 examples/s] Generating train split: 2257 examples [00:10, 198.99 examples/s] Generating train split: 2743 examples [00:10, 317.76 examples/s] Generating train split: 3000 examples [00:10, 393.36 examples/s] Generating train split: 3482 examples [00:13, 275.97 examples/s] Generating train split: 3753 examples [00:16, 202.11 examples/s] Generating train split: 4000 examples [00:16, 255.75 examples/s] Generating train split: 4501 examples [00:18, 227.00 examples/s] Generating train split: 4988 examples [00:18, 345.02 examples/s] Generating train split: 5253 examples [00:21, 224.48 examples/s] Generating train split: 5717 examples [00:21, 338.00 examples/s] Generating train split: 6000 examples [00:21, 419.95 examples/s] Generating train split: 6487 examples [00:24, 268.50 examples/s] Generating train split: 6757 examples [00:27, 200.81 examples/s] Generating train split: 7000 examples [00:27, 250.31 examples/s] Generating train split: 7507 examples [00:29, 226.22 examples/s] Generating train split: 7993 examples [00:30, 341.60 examples/s] Generating train split: 8256 examples [00:32, 226.15 examples/s] Generating train split: 8618 examples [00:32, 313.47 examples/s] Generating train split: 9000 examples [00:32, 433.26 examples/s] Generating train split: 9489 examples [00:35, 294.11 examples/s] Generating train split: 9750 examples [00:38, 210.48 examples/s] Generating train split: 10000 examples [00:38, 263.27 examples/s] Generating train split: 10503 examples [00:40, 231.80 examples/s] Generating train split: 10981 examples [00:40, 347.36 examples/s] Generating train split: 11257 examples [00:43, 229.34 examples/s] Generating train split: 11739 examples [00:43, 348.15 examples/s] Generating train split: 12000 examples [00:43, 427.42 examples/s] Generating train split: 12483 examples [00:46, 289.09 examples/s] Generating train split: 12758 examples [00:48, 209.37 examples/s] Generating train split: 13000 examples [00:48, 261.01 examples/s] Generating train split: 13514 examples [00:51, 232.67 examples/s] Generating train split: 14000 examples [00:51, 342.86 examples/s] Generating train split: 14263 examples [00:54, 229.52 examples/s] Generating train split: 14737 examples [00:54, 346.92 examples/s] Generating train split: 15000 examples [00:54, 427.41 examples/s] Generating train split: 15495 examples [00:57, 291.11 examples/s] Generating train split: 15760 examples [00:59, 208.84 examples/s] Generating train split: 16000 examples [00:59, 260.66 examples/s] Generating train split: 16511 examples [01:02, 230.69 examples/s] Generating train split: 16989 examples [01:02, 346.13 examples/s] Generating train split: 17265 examples [01:05, 228.65 examples/s] Generating train split: 17737 examples [01:05, 344.75 examples/s] Generating train split: 18000 examples [01:05, 420.73 examples/s] Generating train split: 18450 examples [01:07, 279.84 examples/s] Generating train split: 18766 examples [01:10, 203.04 examples/s] Generating train split: 19000 examples [01:10, 251.41 examples/s] Generating train split: 19518 examples [01:13, 228.25 examples/s] Generating train split: 19952 examples [01:13, 330.52 examples/s] Generating train split: 20268 examples [01:16, 229.04 examples/s] Generating train split: 20696 examples [01:16, 332.67 examples/s] Generating train split: 21000 examples [01:16, 421.51 examples/s] Generating train split: 21498 examples [01:18, 289.99 examples/s] Generating train split: 21774 examples [01:21, 209.57 examples/s] Generating train split: 22000 examples [01:21, 257.27 examples/s] Generating train split: 22528 examples [01:24, 232.44 examples/s] Generating train split: 23000 examples [01:24, 340.07 examples/s] Generating train split: 23281 examples [01:26, 230.26 examples/s] Generating train split: 23753 examples [01:27, 346.22 examples/s] Generating train split: 24000 examples [01:27, 417.78 examples/s] Generating train split: 24220 examples [01:29, 231.41 examples/s] Generating train split: 24786 examples [01:29, 397.15 examples/s] Generating train split: 25283 examples [01:32, 283.34 examples/s] Generating train split: 25534 examples [01:35, 206.44 examples/s] Generating train split: 26000 examples [01:35, 305.73 examples/s] Generating train split: 26281 examples [01:37, 216.45 examples/s] Generating train split: 26705 examples [01:37, 316.45 examples/s] Generating train split: 27000 examples [01:37, 404.53 examples/s] Generating train split: 27510 examples [01:40, 287.50 examples/s] Generating train split: 27783 examples [01:43, 208.80 examples/s] Generating train split: 28000 examples [01:43, 254.10 examples/s] Generating train split: 28536 examples [01:45, 233.25 examples/s] Generating train split: 28993 examples [01:45, 343.14 examples/s] Generating train split: 29281 examples [01:46, 425.66 examples/s] Generating train split: 29762 examples [01:48, 290.94 examples/s] Generating train split: 30000 examples [01:48, 352.82 examples/s] Generating train split: 30220 examples [01:51, 212.77 examples/s] Generating train split: 30769 examples [01:51, 359.30 examples/s] Generating train split: 31000 examples [01:54, 221.09 examples/s] Generating train split: 31537 examples [01:56, 207.72 examples/s] Generating train split: 32000 examples [01:57, 299.23 examples/s] Generating train split: 32285 examples [01:59, 217.32 examples/s] Generating train split: 32741 examples [01:59, 321.81 examples/s] Generating train split: 33000 examples [01:59, 392.57 examples/s] Generating train split: 33228 examples [02:02, 226.43 examples/s] Generating train split: 33785 examples [02:04, 221.34 examples/s] Generating train split: 34000 examples [02:05, 265.79 examples/s] Generating train split: 34495 examples [02:05, 416.73 examples/s] Generating train split: 34999 examples [02:07, 294.79 examples/s] Generating train split: 35280 examples [02:07, 366.51 examples/s] Generating train split: 35479 examples [02:10, 217.54 examples/s] Generating train split: 36000 examples [02:10, 349.46 examples/s] Generating train split: 36217 examples [02:13, 217.11 examples/s] Generating train split: 36779 examples [02:13, 363.64 examples/s] Generating train split: 37279 examples [02:16, 272.45 examples/s] Generating train split: 37541 examples [02:18, 203.47 examples/s] Generating train split: 38000 examples [02:18, 297.57 examples/s] Generating train split: 38292 examples [02:21, 215.47 examples/s] Generating train split: 38611 examples [02:21, 287.79 examples/s] Generating train split: 39000 examples [02:21, 404.16 examples/s] Generating train split: 39227 examples [02:24, 230.06 examples/s] Generating train split: 39780 examples [02:24, 386.68 examples/s] Generating train split: 40282 examples [02:26, 279.33 examples/s] Generating train split: 40544 examples [02:29, 206.67 examples/s] Generating train split: 41000 examples [02:29, 300.76 examples/s] Generating train split: 41295 examples [02:32, 217.42 examples/s] Generating train split: 41769 examples [02:32, 328.89 examples/s] Generating train split: 42000 examples [02:32, 395.77 examples/s] Generating train split: 42235 examples [02:34, 226.42 examples/s] Generating train split: 42690 examples [02:35, 355.55 examples/s] Generating train split: 42984 examples [02:37, 219.34 examples/s] Generating train split: 43283 examples [02:37, 292.84 examples/s] Generating train split: 43554 examples [02:40, 201.07 examples/s] Generating train split: 44000 examples [02:40, 309.12 examples/s] Generating train split: 44303 examples [02:43, 215.68 examples/s] Generating train split: 44774 examples [02:43, 335.87 examples/s] Generating train split: 45058 examples [02:45, 220.89 examples/s] Generating train split: 45510 examples [02:46, 334.44 examples/s] Generating train split: 45808 examples [02:48, 227.91 examples/s] Generating train split: 46000 examples [02:48, 272.13 examples/s] Generating train split: 46495 examples [02:48, 441.58 examples/s] Generating train split: 46750 examples [02:51, 246.50 examples/s] Generating train split: 47000 examples [02:51, 315.46 examples/s] Generating train split: 47312 examples [02:54, 213.12 examples/s] Generating train split: 47783 examples [02:54, 342.60 examples/s] Generating train split: 48061 examples [02:56, 218.67 examples/s] Generating train split: 48530 examples [02:56, 340.95 examples/s] Generating train split: 48812 examples [02:59, 226.15 examples/s] Generating train split: 49000 examples [02:59, 268.06 examples/s] Generating train split: 49558 examples [02:59, 464.67 examples/s] Generating train split: 50000 examples [03:02, 292.12 examples/s] Generating train split: 50313 examples [03:04, 215.33 examples/s] Generating train split: 50763 examples [03:05, 319.24 examples/s] Generating train split: 51064 examples [03:07, 220.06 examples/s] Generating train split: 51499 examples [03:07, 323.59 examples/s] Generating train split: 51815 examples [03:10, 227.22 examples/s] Generating train split: 52000 examples [03:10, 268.38 examples/s] Generating train split: 52562 examples [03:10, 458.86 examples/s] Generating train split: 53000 examples [03:13, 288.55 examples/s] Generating train split: 53318 examples [03:15, 216.36 examples/s] Generating train split: 53780 examples [03:15, 322.09 examples/s] Generating train split: 54069 examples [03:18, 220.63 examples/s] Generating train split: 54338 examples [03:18, 269.17 examples/s] Generating train split: 54820 examples [03:21, 230.72 examples/s] Generating train split: 55000 examples [03:21, 269.46 examples/s] Generating train split: 55495 examples [03:21, 429.95 examples/s] Generating train split: 55759 examples [03:24, 248.59 examples/s] Generating train split: 56000 examples [03:24, 313.60 examples/s] Generating train split: 56316 examples [03:26, 213.66 examples/s] Generating train split: 56784 examples [03:26, 340.88 examples/s] Generating train split: 57062 examples [03:29, 219.13 examples/s] Generating train split: 57526 examples [03:29, 339.78 examples/s] Generating train split: 57815 examples [03:32, 226.92 examples/s] Generating train split: 58000 examples [03:32, 268.00 examples/s] Generating train split: 58551 examples [03:32, 461.76 examples/s] Generating train split: 59000 examples [03:35, 290.34 examples/s] Generating train split: 59317 examples [03:37, 217.10 examples/s] Generating train split: 59780 examples [03:37, 324.49 examples/s] Generating train split: 60063 examples [03:40, 220.54 examples/s] Generating train split: 60521 examples [03:40, 330.60 examples/s] Generating train split: 60813 examples [03:43, 226.70 examples/s] Generating train split: 61000 examples [03:43, 268.53 examples/s] Generating train split: 61553 examples [03:43, 456.90 examples/s] Generating train split: 62000 examples [03:46, 288.60 examples/s] Generating train split: 62318 examples [03:48, 217.13 examples/s] Generating train split: 62791 examples [03:48, 325.86 examples/s] Generating train split: 63066 examples [03:51, 220.16 examples/s] Generating train split: 63526 examples [03:51, 330.07 examples/s] Generating train split: 63816 examples [03:53, 226.71 examples/s] Generating train split: 64281 examples [03:54, 336.25 examples/s] Generating train split: 64570 examples [03:56, 228.08 examples/s] Generating train split: 65000 examples [03:56, 329.96 examples/s] Generating train split: 65321 examples [03:59, 219.73 examples/s] Generating train split: 65773 examples [03:59, 328.12 examples/s] Generating train split: 66074 examples [04:02, 223.45 examples/s] Generating train split: 66538 examples [04:02, 336.76 examples/s] Generating train split: 66827 examples [04:05, 228.72 examples/s] Generating train split: 67278 examples [04:05, 333.59 examples/s] Generating train split: 67581 examples [04:07, 228.97 examples/s] Generating train split: 68000 examples [04:07, 328.02 examples/s] Generating train split: 68336 examples [04:10, 230.43 examples/s] Generating train split: 68801 examples [04:10, 346.32 examples/s] Generating train split: 69084 examples [04:13, 227.26 examples/s] Generating train split: 69552 examples [04:13, 343.84 examples/s] Generating train split: 69833 examples [04:15, 230.32 examples/s] Generating train split: 70275 examples [04:16, 333.89 examples/s] Generating train split: 70585 examples [04:18, 230.58 examples/s] Generating train split: 71000 examples [04:18, 329.74 examples/s] Generating train split: 71335 examples [04:21, 230.90 examples/s] Generating train split: 71751 examples [04:21, 334.54 examples/s] Generating train split: 72000 examples [04:21, 411.67 examples/s] Generating train split: 72272 examples [04:24, 236.89 examples/s] Generating train split: 72834 examples [04:24, 402.27 examples/s] Generating train split: 73268 examples [04:26, 273.34 examples/s] Generating train split: 73592 examples [04:29, 210.92 examples/s] Generating train split: 74000 examples [04:29, 294.98 examples/s] Generating train split: 74344 examples [04:32, 221.98 examples/s] Generating train split: 74792 examples [04:32, 327.19 examples/s] Generating train split: 75093 examples [04:35, 224.87 examples/s] Generating train split: 75551 examples [04:35, 335.88 examples/s] Generating train split: 75843 examples [04:37, 229.27 examples/s] Generating train split: 76272 examples [04:37, 329.87 examples/s] Generating train split: 76596 examples [04:40, 231.80 examples/s] Generating train split: 76942 examples [04:40, 317.14 examples/s] Generating train split: 77273 examples [04:40, 422.15 examples/s] Generating train split: 77534 examples [04:43, 239.02 examples/s] Generating train split: 78000 examples [04:43, 366.33 examples/s] Generating train split: 78386 examples [04:46, 241.60 examples/s] Generating train split: 78856 examples [04:48, 217.49 examples/s] Generating train split: 79000 examples [04:48, 244.79 examples/s] Generating train split: 79546 examples [04:48, 405.14 examples/s] Generating train split: 79775 examples [04:51, 237.70 examples/s] Generating train split: 80000 examples [04:51, 295.60 examples/s] Generating train split: 80355 examples [04:54, 214.06 examples/s] Generating train split: 80817 examples [04:54, 336.28 examples/s] Generating train split: 81107 examples [04:56, 220.09 examples/s] Generating train split: 81571 examples [04:57, 338.86 examples/s] Generating train split: 81859 examples [04:59, 227.51 examples/s] Generating train split: 82275 examples [04:59, 325.89 examples/s] Generating train split: 82609 examples [05:02, 231.10 examples/s] Generating train split: 83000 examples [05:02, 325.22 examples/s] Generating train split: 83356 examples [05:05, 232.24 examples/s] Generating train split: 83768 examples [05:05, 335.38 examples/s] Generating train split: 84000 examples [05:05, 407.14 examples/s] Generating train split: 84286 examples [05:07, 237.66 examples/s] Generating train split: 84846 examples [05:07, 402.88 examples/s] Generating train split: 85277 examples [05:10, 271.93 examples/s] Generating train split: 85610 examples [05:13, 211.74 examples/s] Generating train split: 86000 examples [05:13, 291.65 examples/s] Generating train split: 86361 examples [05:15, 223.33 examples/s] Generating train split: 86827 examples [05:15, 332.97 examples/s] Generating train split: 87109 examples [05:18, 223.51 examples/s] Generating train split: 87569 examples [05:18, 334.67 examples/s] Generating train split: 87867 examples [05:21, 228.90 examples/s] Generating train split: 88273 examples [05:21, 323.85 examples/s] Generating train split: 88613 examples [05:24, 231.35 examples/s] Generating train split: 89000 examples [05:24, 321.14 examples/s] Generating train split: 89363 examples [05:26, 234.07 examples/s] Generating train split: 89828 examples [05:26, 350.67 examples/s] Generating train split: 90110 examples [05:29, 229.03 examples/s] Generating train split: 90527 examples [05:29, 332.64 examples/s] Generating train split: 90861 examples [05:32, 223.77 examples/s] Generating train split: 91277 examples [05:32, 319.64 examples/s] Generating train split: 91613 examples [05:35, 228.53 examples/s] Generating train split: 92000 examples [05:35, 318.86 examples/s] Generating train split: 92365 examples [05:37, 232.73 examples/s] Generating train split: 92830 examples [05:37, 348.41 examples/s] Generating train split: 93113 examples [05:40, 228.18 examples/s] Generating train split: 93573 examples [05:40, 342.47 examples/s] Generating train split: 93863 examples [05:43, 230.49 examples/s] Generating train split: 94281 examples [05:43, 328.01 examples/s] Generating train split: 94617 examples [05:45, 232.81 examples/s] Generating train split: 95000 examples [05:46, 323.80 examples/s] Generating train split: 95369 examples [05:48, 235.64 examples/s] Generating train split: 95667 examples [05:48, 308.45 examples/s] Generating train split: 96000 examples [05:48, 415.17 examples/s] Generating train split: 96292 examples [05:51, 242.82 examples/s] Generating train split: 96853 examples [05:51, 407.66 examples/s] Generating train split: 97275 examples [05:54, 272.40 examples/s] Generating train split: 97620 examples [05:56, 214.31 examples/s] Generating train split: 98000 examples [05:56, 294.01 examples/s] Generating train split: 98369 examples [05:59, 224.40 examples/s] Generating train split: 98834 examples [05:59, 333.56 examples/s] Generating train split: 99119 examples [06:02, 224.72 examples/s] Generating train split: 99580 examples [06:02, 336.11 examples/s] Generating train split: 99869 examples [06:04, 228.35 examples/s] Generating train split: 100272 examples [06:05, 322.86 examples/s] Generating train split: 100622 examples [06:07, 232.50 examples/s] Generating train split: 101000 examples [06:07, 319.08 examples/s] Generating train split: 101372 examples [06:10, 233.46 examples/s] Generating train split: 101822 examples [06:10, 345.35 examples/s] Generating train split: 102119 examples [06:13, 230.18 examples/s] Generating train split: 102456 examples [06:13, 312.67 examples/s] Generating train split: 102867 examples [06:16, 227.57 examples/s] Generating train split: 103276 examples [06:16, 321.35 examples/s] Generating train split: 103616 examples [06:18, 230.64 examples/s] Generating train split: 103948 examples [06:18, 310.86 examples/s] Generating train split: 104277 examples [06:19, 413.98 examples/s] Generating train split: 104539 examples [06:21, 236.60 examples/s] Generating train split: 105000 examples [06:21, 361.15 examples/s] Generating train split: 105297 examples [06:24, 232.55 examples/s] Generating train split: 105854 examples [06:24, 382.60 examples/s] Generating train split: 106268 examples [06:27, 266.33 examples/s] Generating train split: 106615 examples [06:29, 210.44 examples/s] Generating train split: 107000 examples [06:29, 288.90 examples/s] Generating train split: 107364 examples [06:32, 221.39 examples/s] Generating train split: 107829 examples [06:32, 328.80 examples/s] Generating train split: 108116 examples [06:35, 223.94 examples/s] Generating train split: 108579 examples [06:35, 335.10 examples/s] Generating train split: 108863 examples [06:37, 228.22 examples/s] Generating train split: 109278 examples [06:38, 323.74 examples/s] Generating train split: 109612 examples [06:40, 230.75 examples/s] Generating train split: 110000 examples [06:40, 322.40 examples/s] Generating train split: 110366 examples [06:43, 233.12 examples/s] Generating train split: 110777 examples [06:43, 334.90 examples/s] Generating train split: 111000 examples [06:43, 403.03 examples/s] Generating train split: 111294 examples [06:46, 238.62 examples/s] Generating train split: 111855 examples [06:46, 403.68 examples/s] Generating train split: 112268 examples [06:48, 268.02 examples/s] Generating train split: 112619 examples [06:51, 212.74 examples/s] Generating train split: 113000 examples [06:51, 290.28 examples/s] Generating train split: 113371 examples [06:54, 224.43 examples/s] Generating train split: 113834 examples [06:54, 333.17 examples/s] Generating train split: 114117 examples [06:56, 224.56 examples/s] Generating train split: 114570 examples [06:57, 334.18 examples/s] Generating train split: 114867 examples [06:59, 218.11 examples/s] Generating train split: 115265 examples [07:00, 307.57 examples/s] Generating train split: 115616 examples [07:02, 226.79 examples/s] Generating train split: 116000 examples [07:02, 313.28 examples/s] Generating train split: 116366 examples [07:05, 231.15 examples/s] Generating train split: 116823 examples [07:05, 344.10 examples/s] Generating train split: 117116 examples [07:08, 228.63 examples/s] Generating train split: 117434 examples [07:08, 305.69 examples/s] Generating train split: 117868 examples [07:10, 238.99 examples/s] Generating train split: 118273 examples [07:10, 335.00 examples/s] Generating train split: 118622 examples [07:13, 237.36 examples/s] Generating train split: 119000 examples [07:13, 326.23 examples/s] Generating train split: 119373 examples [07:16, 237.93 examples/s] Generating train split: 119838 examples [07:16, 354.84 examples/s] Generating train split: 120125 examples [07:18, 231.11 examples/s] Generating train split: 120588 examples [07:19, 346.59 examples/s] Generating train split: 120876 examples [07:21, 231.85 examples/s] Generating train split: 121262 examples [07:21, 320.97 examples/s] Generating train split: 121626 examples [07:21, 434.64 examples/s] Generating train split: 121635 examples [07:21, 275.20 examples/s] Unable to verify splits sizes. 2024-10-10 13:11:14 - INFO - datasets.utils.info_utils - Unable to verify splits sizes. Dataset generator downloaded and prepared to /root/.cache/huggingface/datasets/generator/default-2bbb001ed7b41534/0.0.0. Subsequent calls will reuse this data. 2024-10-10 13:11:14 - INFO - datasets.builder - Dataset generator downloaded and prepared to /root/.cache/huggingface/datasets/generator/default-2bbb001ed7b41534/0.0.0. Subsequent calls will reuse this data. Using custom data configuration default-b6f2d3b37c5ee650 2024-10-10 13:11:14 - INFO - datasets.builder - Using custom data configuration default-b6f2d3b37c5ee650 Loading Dataset Infos from /workspace/jihaozhe/miniconda3/envs/handbook/lib/python3.10/site-packages/datasets/packaged_modules/generator 2024-10-10 13:11:14 - INFO - datasets.info - Loading Dataset Infos from /workspace/jihaozhe/miniconda3/envs/handbook/lib/python3.10/site-packages/datasets/packaged_modules/generator Generating dataset generator (/root/.cache/huggingface/datasets/generator/default-b6f2d3b37c5ee650/0.0.0) 2024-10-10 13:11:14 - INFO - datasets.builder - Generating dataset generator (/root/.cache/huggingface/datasets/generator/default-b6f2d3b37c5ee650/0.0.0) Downloading and preparing dataset generator/default to /root/.cache/huggingface/datasets/generator/default-b6f2d3b37c5ee650/0.0.0... 2024-10-10 13:11:14 - INFO - datasets.builder - Downloading and preparing dataset generator/default to /root/.cache/huggingface/datasets/generator/default-b6f2d3b37c5ee650/0.0.0... Generating train split 2024-10-10 13:11:14 - INFO - datasets.builder - Generating train split Generating train split: 0 examples [00:00, ? examples/s] Generating train split: 1 examples [00:02, 2.47s/ examples] Generating train split: 482 examples [00:02, 263.82 examples/s] Generating train split: 958 examples [00:05, 209.85 examples/s] Generating train split: 1291 examples [00:05, 318.47 examples/s] Generating train split: 1500 examples [00:07, 186.47 examples/s] Generating train split: 1980 examples [00:07, 328.17 examples/s] Generating train split: 2254 examples [00:10, 207.98 examples/s] Generating train split: 2736 examples [00:10, 338.26 examples/s] Generating train split: 3193 examples [00:13, 251.48 examples/s] Generating train split: 3756 examples [00:15, 237.90 examples/s] Generating train split: 4000 examples [00:16, 286.29 examples/s] Generating train split: 4506 examples [00:18, 235.43 examples/s] Generating train split: 4973 examples [00:18, 339.67 examples/s] Generating train split: 5256 examples [00:21, 232.43 examples/s] Generating train split: 5741 examples [00:21, 347.90 examples/s] Generating train split: 6000 examples [00:21, 421.06 examples/s] Generating train split: 6464 examples [00:24, 285.62 examples/s] Generating train split: 6755 examples [00:26, 210.68 examples/s] Generating train split: 7000 examples [00:27, 261.83 examples/s] Generating train split: 7508 examples [00:29, 232.33 examples/s] Generating train split: 7964 examples [00:29, 341.92 examples/s] Generating train split: 8260 examples [00:32, 231.27 examples/s] Generating train split: 8739 examples [00:32, 349.47 examples/s] Generating train split: 9000 examples [00:32, 425.09 examples/s] Generating train split: 9471 examples [00:35, 287.02 examples/s] Generating train split: 9763 examples [00:37, 210.61 examples/s] Generating train split: 10000 examples [00:37, 260.22 examples/s] Generating train split: 10520 examples [00:40, 234.22 examples/s] Generating train split: 10988 examples [00:40, 347.49 examples/s] Generating train split: 11262 examples [00:40, 426.02 examples/s] Generating train split: 11704 examples [00:43, 284.07 examples/s] Generating train split: 12000 examples [00:43, 363.28 examples/s] Generating train split: 12202 examples [00:45, 214.65 examples/s] Generating train split: 12767 examples [00:45, 369.26 examples/s] Generating train split: 13288 examples [00:48, 289.13 examples/s] Generating train split: 13458 examples [00:48, 277.59 examples/s] Unable to verify splits sizes. 2024-10-10 13:12:03 - INFO - datasets.utils.info_utils - Unable to verify splits sizes. Dataset generator downloaded and prepared to /root/.cache/huggingface/datasets/generator/default-b6f2d3b37c5ee650/0.0.0. Subsequent calls will reuse this data. 2024-10-10 13:12:03 - INFO - datasets.builder - Dataset generator downloaded and prepared to /root/.cache/huggingface/datasets/generator/default-b6f2d3b37c5ee650/0.0.0. Subsequent calls will reuse this data. 2024-10-10 13:12:03 - WARNING - accelerate.utils.other - Detected kernel version 5.4.119, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. Generating train split: 0 examples [00:00, ? examples/s][INFO|trainer.py:648] 2024-10-10 13:12:04,223 >> Using auto half precision backend 2024-10-10 13:12:04 - INFO - __main__ - *** Train *** [2024-10-10 13:12:05,260] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.15.1, git-hash=unknown, git-branch=unknown [2024-10-10 13:12:05,260] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 8 [2024-10-10 13:12:05,265] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False [2024-10-10 13:12:05,266] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer [2024-10-10 13:12:05,266] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer [2024-10-10 13:12:05,268] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = AdamW [2024-10-10 13:12:05,268] [INFO] [utils.py:59:is_zero_supported_optimizer] Checking ZeRO support for optimizer=AdamW type= [2024-10-10 13:12:05,268] [INFO] [logging.py:96:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer, MiCS is enabled False, Hierarchical params gather False [2024-10-10 13:12:05,268] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 3 optimizer [2024-10-10 13:12:05,446] [INFO] [utils.py:781:see_memory_usage] Stage 3 initialize beginning [2024-10-10 13:12:05,447] [INFO] [utils.py:782:see_memory_usage] MA 0.29 GB Max_MA 1.27 GB CA 0.59 GB Max_CA 2 GB [2024-10-10 13:12:05,447] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 82.61 GB, percent = 4.4% [2024-10-10 13:12:05,448] [INFO] [stage3.py:164:__init__] Reduce bucket size 500000000 [2024-10-10 13:12:05,448] [INFO] [stage3.py:165:__init__] Prefetch bucket size 50000000 [2024-10-10 13:12:05,566] [INFO] [utils.py:781:see_memory_usage] DeepSpeedZeRoOffload initialize [begin] [2024-10-10 13:12:05,567] [INFO] [utils.py:782:see_memory_usage] MA 0.29 GB Max_MA 0.29 GB CA 0.59 GB Max_CA 1 GB [2024-10-10 13:12:05,567] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 82.62 GB, percent = 4.4% Parameter Offload: Total persistent parameters: 67584 in 33 params [2024-10-10 13:12:05,690] [INFO] [utils.py:781:see_memory_usage] DeepSpeedZeRoOffload initialize [end] [2024-10-10 13:12:05,690] [INFO] [utils.py:782:see_memory_usage] MA 0.29 GB Max_MA 0.29 GB CA 0.59 GB Max_CA 1 GB [2024-10-10 13:12:05,690] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 82.63 GB, percent = 4.4% [2024-10-10 13:12:05,804] [INFO] [utils.py:781:see_memory_usage] Before creating fp16 partitions [2024-10-10 13:12:05,804] [INFO] [utils.py:782:see_memory_usage] MA 0.29 GB Max_MA 0.29 GB CA 0.59 GB Max_CA 1 GB [2024-10-10 13:12:05,805] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 82.66 GB, percent = 4.4% [WARNING|tokenization_utils_base.py:4101] 2024-10-10 13:12:05,843 >> Token indices sequence length is longer than the specified maximum sequence length for this model (2871 > 2048). Running this sequence through the model will result in indexing errors Generating train split: 1 examples [00:02, 2.41s/ examples] Generating train split: 518 examples [00:02, 289.38 examples/s] Generating train split: 915 examples [00:05, 199.13 examples/s] Generating train split: 1299 examples [00:05, 326.66 examples/s] Generating train split: 1708 examples [00:07, 233.75 examples/s] Generating train split: 2000 examples [00:07, 313.03 examples/s] Generating train split: 2254 examples [00:10, 201.22 examples/s] Generating train split: 2738 examples [00:10, 334.32 examples/s] Generating train split: 3000 examples [00:10, 422.38 examples/s] Generating train split: 3504 examples [00:13, 287.31 examples/s] Generating train split: 3756 examples [00:15, 202.98 examples/s] Generating train split: 4000 examples [00:15, 257.20 examples/s] Generating train split: 4506 examples [00:18, 229.92 examples/s] Generating train split: 5000 examples [00:18, 344.84 examples/s] Generating train split: 5256 examples [00:21, 226.82 examples/s] Generating train split: 5747 examples [00:21, 349.40 examples/s] Generating train split: 6000 examples [00:21, 426.05 examples/s] Generating train split: 6399 examples [00:23, 269.36 examples/s] Generating train split: 6755 examples [00:26, 204.50 examples/s] Generating train split: 7000 examples [00:26, 255.51 examples/s] Generating train split: 7508 examples [00:29, 229.78 examples/s] Generating train split: 7893 examples [00:29, 320.56 examples/s] Generating train split: 8260 examples [00:32, 232.25 examples/s] Generating train split: 8750 examples [00:32, 351.85 examples/s] Generating train split: 9000 examples [00:32, 422.37 examples/s] Generating train split: 9503 examples [00:34, 293.84 examples/s] Generating train split: 9763 examples [00:37, 208.46 examples/s] Generating train split: 10000 examples [00:37, 259.06 examples/s] Generating train split: 10495 examples [00:37, 413.67 examples/s] Generating train split: 11000 examples [00:40, 291.24 examples/s] Generating train split: 11268 examples [00:42, 210.36 examples/s] Generating train split: 11750 examples [00:43, 319.86 examples/s] Generating train split: 12000 examples [00:43, 387.68 examples/s] Generating train split: 12505 examples [00:45, 282.51 examples/s] Generating train split: 12772 examples [00:48, 213.52 examples/s] Generating train split: 13000 examples [00:48, 263.04 examples/s] Generating train split: 13458 examples [00:48, 278.51 examples/s] [2024-10-10 13:12:53,747] [INFO] [utils.py:781:see_memory_usage] After creating fp16 partitions: 2 [2024-10-10 13:12:53,748] [INFO] [utils.py:782:see_memory_usage] MA 0.29 GB Max_MA 0.29 GB CA 0.29 GB Max_CA 1 GB [2024-10-10 13:12:53,748] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 82.83 GB, percent = 4.4% [2024-10-10 13:12:53,858] [INFO] [utils.py:781:see_memory_usage] Before creating fp32 partitions [2024-10-10 13:12:53,858] [INFO] [utils.py:782:see_memory_usage] MA 0.29 GB Max_MA 0.29 GB CA 0.29 GB Max_CA 0 GB [2024-10-10 13:12:53,859] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 82.83 GB, percent = 4.4% [2024-10-10 13:12:53,967] [INFO] [utils.py:781:see_memory_usage] After creating fp32 partitions [2024-10-10 13:12:53,968] [INFO] [utils.py:782:see_memory_usage] MA 0.86 GB Max_MA 1.15 GB CA 1.16 GB Max_CA 1 GB [2024-10-10 13:12:53,968] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 82.83 GB, percent = 4.4% [2024-10-10 13:12:54,077] [INFO] [utils.py:781:see_memory_usage] Before initializing optimizer states [2024-10-10 13:12:54,078] [INFO] [utils.py:782:see_memory_usage] MA 0.86 GB Max_MA 0.86 GB CA 1.16 GB Max_CA 1 GB [2024-10-10 13:12:54,078] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 82.83 GB, percent = 4.4% [2024-10-10 13:12:54,240] [INFO] [utils.py:781:see_memory_usage] After initializing optimizer states [2024-10-10 13:12:54,241] [INFO] [utils.py:782:see_memory_usage] MA 0.86 GB Max_MA 1.44 GB CA 1.73 GB Max_CA 2 GB [2024-10-10 13:12:54,241] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 82.83 GB, percent = 4.4% [2024-10-10 13:12:54,242] [INFO] [stage3.py:517:_setup_for_real_optimizer] optimizer state initialized [2024-10-10 13:12:54,472] [INFO] [utils.py:781:see_memory_usage] After initializing ZeRO optimizer [2024-10-10 13:12:54,473] [INFO] [utils.py:782:see_memory_usage] MA 2.08 GB Max_MA 3.06 GB CA 3.15 GB Max_CA 3 GB [2024-10-10 13:12:54,473] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 82.94 GB, percent = 4.4% [2024-10-10 13:12:54,473] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedZeroOptimizer_Stage3 [2024-10-10 13:12:54,473] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using configured LR scheduler = None [2024-10-10 13:12:54,473] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = None [2024-10-10 13:12:54,473] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0, 0.0], mom=[(0.9, 0.999), (0.9, 0.999)] [2024-10-10 13:12:54,474] [INFO] [config.py:999:print] DeepSpeedEngine configuration: [2024-10-10 13:12:54,474] [INFO] [config.py:1003:print] activation_checkpointing_config { "partition_activations": false, "contiguous_memory_optimization": false, "cpu_checkpointing": false, "number_checkpoints": null, "synchronize_checkpoint_boundary": false, "profile": false } [2024-10-10 13:12:54,474] [INFO] [config.py:1003:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True, 'use_gds': False} [2024-10-10 13:12:54,474] [INFO] [config.py:1003:print] amp_enabled .................. False [2024-10-10 13:12:54,474] [INFO] [config.py:1003:print] amp_params ................... False [2024-10-10 13:12:54,475] [INFO] [config.py:1003:print] autotuning_config ............ { "enabled": false, "start_step": null, "end_step": null, "metric_path": null, "arg_mappings": null, "metric": "throughput", "model_info": null, "results_dir": "autotuning_results", "exps_dir": "autotuning_exps", "overwrite": true, "fast": true, "start_profile_step": 3, "end_profile_step": 5, "tuner_type": "gridsearch", "tuner_early_stopping": 5, "tuner_num_trials": 50, "model_info_path": null, "mp_size": 1, "max_train_batch_size": null, "min_train_batch_size": 1, "max_train_micro_batch_size_per_gpu": 1.024000e+03, "min_train_micro_batch_size_per_gpu": 1, "num_tuning_micro_batch_sizes": 3 } [2024-10-10 13:12:54,475] [INFO] [config.py:1003:print] bfloat16_enabled ............. True [2024-10-10 13:12:54,475] [INFO] [config.py:1003:print] bfloat16_immediate_grad_update False [2024-10-10 13:12:54,475] [INFO] [config.py:1003:print] checkpoint_parallel_write_pipeline False [2024-10-10 13:12:54,475] [INFO] [config.py:1003:print] checkpoint_tag_validation_enabled True [2024-10-10 13:12:54,475] [INFO] [config.py:1003:print] checkpoint_tag_validation_fail False [2024-10-10 13:12:54,475] [INFO] [config.py:1003:print] comms_config ................. [2024-10-10 13:12:54,475] [INFO] [config.py:1003:print] communication_data_type ...... None [2024-10-10 13:12:54,475] [INFO] [config.py:1003:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} [2024-10-10 13:12:54,475] [INFO] [config.py:1003:print] curriculum_enabled_legacy .... False [2024-10-10 13:12:54,475] [INFO] [config.py:1003:print] curriculum_params_legacy ..... False [2024-10-10 13:12:54,475] [INFO] [config.py:1003:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} [2024-10-10 13:12:54,475] [INFO] [config.py:1003:print] data_efficiency_enabled ...... False [2024-10-10 13:12:54,475] [INFO] [config.py:1003:print] dataloader_drop_last ......... False [2024-10-10 13:12:54,475] [INFO] [config.py:1003:print] disable_allgather ............ False [2024-10-10 13:12:54,475] [INFO] [config.py:1003:print] dump_state ................... False [2024-10-10 13:12:54,475] [INFO] [config.py:1003:print] dynamic_loss_scale_args ...... None [2024-10-10 13:12:54,475] [INFO] [config.py:1003:print] eigenvalue_enabled ........... False [2024-10-10 13:12:54,475] [INFO] [config.py:1003:print] eigenvalue_gas_boundary_resolution 1 [2024-10-10 13:12:54,475] [INFO] [config.py:1003:print] eigenvalue_layer_name ........ bert.encoder.layer [2024-10-10 13:12:54,475] [INFO] [config.py:1003:print] eigenvalue_layer_num ......... 0 [2024-10-10 13:12:54,475] [INFO] [config.py:1003:print] eigenvalue_max_iter .......... 100 [2024-10-10 13:12:54,475] [INFO] [config.py:1003:print] eigenvalue_stability ......... 1e-06 [2024-10-10 13:12:54,475] [INFO] [config.py:1003:print] eigenvalue_tol ............... 0.01 [2024-10-10 13:12:54,475] [INFO] [config.py:1003:print] eigenvalue_verbose ........... False [2024-10-10 13:12:54,475] [INFO] [config.py:1003:print] elasticity_enabled ........... False [2024-10-10 13:12:54,475] [INFO] [config.py:1003:print] flops_profiler_config ........ { "enabled": false, "recompute_fwd_factor": 0.0, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null } [2024-10-10 13:12:54,475] [INFO] [config.py:1003:print] fp16_auto_cast ............... None [2024-10-10 13:12:54,475] [INFO] [config.py:1003:print] fp16_enabled ................. False [2024-10-10 13:12:54,475] [INFO] [config.py:1003:print] fp16_master_weights_and_gradients False [2024-10-10 13:12:54,475] [INFO] [config.py:1003:print] global_rank .................. 0 [2024-10-10 13:12:54,475] [INFO] [config.py:1003:print] grad_accum_dtype ............. None [2024-10-10 13:12:54,475] [INFO] [config.py:1003:print] gradient_accumulation_steps .. 1 [2024-10-10 13:12:54,475] [INFO] [config.py:1003:print] gradient_clipping ............ 1.0 [2024-10-10 13:12:54,475] [INFO] [config.py:1003:print] gradient_predivide_factor .... 1.0 [2024-10-10 13:12:54,475] [INFO] [config.py:1003:print] graph_harvesting ............. False [2024-10-10 13:12:54,475] [INFO] [config.py:1003:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8 [2024-10-10 13:12:54,475] [INFO] [config.py:1003:print] initial_dynamic_scale ........ 1 [2024-10-10 13:12:54,475] [INFO] [config.py:1003:print] load_universal_checkpoint .... False [2024-10-10 13:12:54,475] [INFO] [config.py:1003:print] loss_scale ................... 1.0 [2024-10-10 13:12:54,475] [INFO] [config.py:1003:print] memory_breakdown ............. False [2024-10-10 13:12:54,475] [INFO] [config.py:1003:print] mics_hierarchial_params_gather False [2024-10-10 13:12:54,475] [INFO] [config.py:1003:print] mics_shard_size .............. -1 [2024-10-10 13:12:54,475] [INFO] [config.py:1003:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') [2024-10-10 13:12:54,475] [INFO] [config.py:1003:print] nebula_config ................ { "enabled": false, "persistent_storage_path": null, "persistent_time_interval": 100, "num_of_version_in_retention": 2, "enable_nebula_load": true, "load_path": null } [2024-10-10 13:12:54,475] [INFO] [config.py:1003:print] optimizer_legacy_fusion ...... False [2024-10-10 13:12:54,475] [INFO] [config.py:1003:print] optimizer_name ............... None [2024-10-10 13:12:54,475] [INFO] [config.py:1003:print] optimizer_params ............. None [2024-10-10 13:12:54,476] [INFO] [config.py:1003:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True} [2024-10-10 13:12:54,476] [INFO] [config.py:1003:print] pld_enabled .................. False [2024-10-10 13:12:54,476] [INFO] [config.py:1003:print] pld_params ................... False [2024-10-10 13:12:54,476] [INFO] [config.py:1003:print] prescale_gradients ........... False [2024-10-10 13:12:54,476] [INFO] [config.py:1003:print] scheduler_name ............... None [2024-10-10 13:12:54,476] [INFO] [config.py:1003:print] scheduler_params ............. None [2024-10-10 13:12:54,476] [INFO] [config.py:1003:print] seq_parallel_communication_data_type torch.float32 [2024-10-10 13:12:54,476] [INFO] [config.py:1003:print] sparse_attention ............. None [2024-10-10 13:12:54,476] [INFO] [config.py:1003:print] sparse_gradients_enabled ..... False [2024-10-10 13:12:54,476] [INFO] [config.py:1003:print] steps_per_print .............. inf [2024-10-10 13:12:54,476] [INFO] [config.py:1003:print] timers_config ................ enabled=True synchronized=True [2024-10-10 13:12:54,476] [INFO] [config.py:1003:print] train_batch_size ............. 128 [2024-10-10 13:12:54,476] [INFO] [config.py:1003:print] train_micro_batch_size_per_gpu 16 [2024-10-10 13:12:54,476] [INFO] [config.py:1003:print] use_data_before_expert_parallel_ False [2024-10-10 13:12:54,476] [INFO] [config.py:1003:print] use_node_local_storage ....... False [2024-10-10 13:12:54,476] [INFO] [config.py:1003:print] wall_clock_breakdown ......... False [2024-10-10 13:12:54,476] [INFO] [config.py:1003:print] weight_quantization_config ... None [2024-10-10 13:12:54,476] [INFO] [config.py:1003:print] world_size ................... 8 [2024-10-10 13:12:54,476] [INFO] [config.py:1003:print] zero_allow_untested_optimizer True [2024-10-10 13:12:54,476] [INFO] [config.py:1003:print] zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500000000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100000000, max_in_cpu=1000000000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='none', nvme_path=None, buffer_count=4, pin_memory=False, pipeline_read=False, pipeline_write=False, fast_init=False, ratio=1.0) sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50000000 param_persistence_threshold=100000 model_persistence_threshold=9223372036854775807 max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=True use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True [2024-10-10 13:12:54,476] [INFO] [config.py:1003:print] zero_enabled ................. True [2024-10-10 13:12:54,476] [INFO] [config.py:1003:print] zero_force_ds_cpu_optimizer .. True [2024-10-10 13:12:54,476] [INFO] [config.py:1003:print] zero_optimization_stage ...... 3 [2024-10-10 13:12:54,476] [INFO] [config.py:989:print_user_config] json = { "train_batch_size": 128, "train_micro_batch_size_per_gpu": 16, "gradient_accumulation_steps": 1, "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "none", "nvme_path": null }, "offload_param": { "device": "none", "nvme_path": null }, "stage3_gather_16bit_weights_on_model_save": true }, "gradient_clipping": 1.0, "steps_per_print": inf, "bf16": { "enabled": true }, "fp16": { "enabled": false }, "zero_allow_untested_optimizer": true } [INFO|trainer.py:2134] 2024-10-10 13:12:54,476 >> ***** Running training ***** [INFO|trainer.py:2135] 2024-10-10 13:12:54,476 >> Num examples = 121,635 [INFO|trainer.py:2136] 2024-10-10 13:12:54,476 >> Num Epochs = 1 [INFO|trainer.py:2137] 2024-10-10 13:12:54,476 >> Instantaneous batch size per device = 16 [INFO|trainer.py:2140] 2024-10-10 13:12:54,476 >> Total train batch size (w. parallel, distributed & accumulation) = 128 [INFO|trainer.py:2141] 2024-10-10 13:12:54,476 >> Gradient Accumulation steps = 1 [INFO|trainer.py:2142] 2024-10-10 13:12:54,476 >> Total optimization steps = 951 [INFO|trainer.py:2143] 2024-10-10 13:12:54,477 >> Number of trainable parameters = 1,235,814,400 0%| | 0/951 [00:00> Saving model checkpoint to /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-100 [INFO|configuration_utils.py:472] 2024-10-10 13:14:44,050 >> Configuration saved in /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-100/config.json [INFO|configuration_utils.py:807] 2024-10-10 13:14:44,053 >> Configuration saved in /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-100/generation_config.json [INFO|modeling_utils.py:2799] 2024-10-10 13:14:46,193 >> Model weights saved in /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-100/model.safetensors [INFO|tokenization_utils_base.py:2684] 2024-10-10 13:14:46,199 >> tokenizer config file saved in /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-100/tokenizer_config.json [INFO|tokenization_utils_base.py:2693] 2024-10-10 13:14:46,202 >> Special tokens file saved in /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-100/special_tokens_map.json [2024-10-10 13:14:46,355] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint global_step100 is about to be saved! [2024-10-10 13:14:46,363] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-100/global_step100/zero_pp_rank_0_mp_rank_00_model_states.pt [2024-10-10 13:14:46,363] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-100/global_step100/zero_pp_rank_0_mp_rank_00_model_states.pt... [2024-10-10 13:14:46,381] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-100/global_step100/zero_pp_rank_0_mp_rank_00_model_states.pt. [2024-10-10 13:14:46,390] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-100/global_step100/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... [2024-10-10 13:14:49,347] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-100/global_step100/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. [2024-10-10 13:14:49,363] [INFO] [engine.py:3536:_save_zero_checkpoint] zero checkpoint saved /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-100/global_step100/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt [2024-10-10 13:14:49,998] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step100 is ready now! /workspace/jihaozhe/miniconda3/envs/handbook/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined] 11%|█ | 101/951 [01:56<43:15, 3.05s/it] 11%|█ | 102/951 [01:57<34:31, 2.44s/it] 11%|█ | 103/951 [01:58<28:26, 2.01s/it] 11%|█ | 104/951 [01:59<24:09, 1.71s/it] 11%|█ | 105/951 [02:00<21:09, 1.50s/it] {'loss': 1.321, 'grad_norm': 0.8322397994562853, 'learning_rate': 1.999453257340926e-05, 'epoch': 0.11} 11%|█ | 105/951 [02:00<21:09, 1.50s/it] 11%|█ | 106/951 [02:01<19:06, 1.36s/it] 11%|█▏ | 107/951 [02:02<17:36, 1.25s/it] 11%|█▏ | 108/951 [02:03<16:35, 1.18s/it] 11%|█▏ | 109/951 [02:04<15:50, 1.13s/it] 12%|█▏ | 110/951 [02:05<15:19, 1.09s/it] {'loss': 1.3145, 'grad_norm': 0.8169700972120059, 'learning_rate': 1.9986771889316172e-05, 'epoch': 0.12} 12%|█▏ | 110/951 [02:05<15:19, 1.09s/it] 12%|█▏ | 111/951 [02:06<15:56, 1.14s/it] 12%|█▏ | 112/951 [02:07<15:23, 1.10s/it] 12%|█▏ | 113/951 [02:09<15:38, 1.12s/it] 12%|█▏ | 114/951 [02:10<15:09, 1.09s/it] 12%|█▏ | 115/951 [02:11<14:49, 1.06s/it] {'loss': 1.3537, 'grad_norm': 0.8054378165910604, 'learning_rate': 1.9975640502598243e-05, 'epoch': 0.12} 12%|█▏ | 115/951 [02:11<14:49, 1.06s/it] 12%|█▏ | 116/951 [02:12<14:37, 1.05s/it] 12%|█▏ | 117/951 [02:13<14:25, 1.04s/it] 12%|█▏ | 118/951 [02:14<14:18, 1.03s/it] 13%|█▎ | 119/951 [02:15<14:11, 1.02s/it] 13%|█▎ | 120/951 [02:16<14:07, 1.02s/it] {'loss': 1.3081, 'grad_norm': 0.8851185792466449, 'learning_rate': 1.9961142170284762e-05, 'epoch': 0.13} 13%|█▎ | 120/951 [02:16<14:07, 1.02s/it] 13%|█▎ | 121/951 [02:17<14:06, 1.02s/it] 13%|█▎ | 122/951 [02:18<14:02, 1.02s/it] 13%|█▎ | 123/951 [02:19<14:00, 1.02s/it] 13%|█▎ | 124/951 [02:20<13:58, 1.01s/it] 13%|█▎ | 125/951 [02:21<13:57, 1.01s/it] {'loss': 1.3235, 'grad_norm': 0.7983608745240434, 'learning_rate': 1.9943281785805483e-05, 'epoch': 0.13} 13%|█▎ | 125/951 [02:21<13:57, 1.01s/it] 13%|█▎ | 126/951 [02:22<13:57, 1.01s/it] 13%|█▎ | 127/951 [02:23<13:55, 1.01s/it] 13%|█▎ | 128/951 [02:24<13:53, 1.01s/it] 14%|█▎ | 129/951 [02:25<13:52, 1.01s/it] 14%|█▎ | 130/951 [02:26<13:50, 1.01s/it] {'loss': 1.3234, 'grad_norm': 0.8101924460502825, 'learning_rate': 1.9922065377339037e-05, 'epoch': 0.14} 14%|█▎ | 130/951 [02:26<13:50, 1.01s/it] 14%|█▍ | 131/951 [02:27<13:50, 1.01s/it] 14%|█▍ | 132/951 [02:28<13:49, 1.01s/it] 14%|█▍ | 133/951 [02:29<13:47, 1.01s/it] 14%|█▍ | 134/951 [02:30<13:47, 1.01s/it] 14%|█▍ | 135/951 [02:31<13:45, 1.01s/it] {'loss': 1.3275, 'grad_norm': 0.8181918475547033, 'learning_rate': 1.98975001057783e-05, 'epoch': 0.14} 14%|█▍ | 135/951 [02:31<13:45, 1.01s/it] 14%|█▍ | 136/951 [02:32<13:45, 1.01s/it] 14%|█▍ | 137/951 [02:33<13:45, 1.01s/it] 15%|█▍ | 138/951 [02:34<13:42, 1.01s/it] 15%|█▍ | 139/951 [02:35<13:42, 1.01s/it] 15%|█▍ | 140/951 [02:36<13:41, 1.01s/it] {'loss': 1.3191, 'grad_norm': 0.8204094295073951, 'learning_rate': 1.986959426231349e-05, 'epoch': 0.15} 15%|█▍ | 140/951 [02:36<13:41, 1.01s/it] 15%|█▍ | 141/951 [02:37<13:40, 1.01s/it] 15%|█▍ | 142/951 [02:38<13:39, 1.01s/it] 15%|█▌ | 143/951 [02:39<13:37, 1.01s/it] 15%|█▌ | 144/951 [02:40<13:36, 1.01s/it] 15%|█▌ | 145/951 [02:41<13:35, 1.01s/it] {'loss': 1.3151, 'grad_norm': 0.8193214439419285, 'learning_rate': 1.983835726563373e-05, 'epoch': 0.15} 15%|█▌ | 145/951 [02:41<13:35, 1.01s/it] 15%|█▌ | 146/951 [02:42<13:35, 1.01s/it] 15%|█▌ | 147/951 [02:43<13:33, 1.01s/it] 16%|█▌ | 148/951 [02:44<13:32, 1.01s/it] 16%|█▌ | 149/951 [02:45<13:31, 1.01s/it] 16%|█▌ | 150/951 [02:46<13:31, 1.01s/it] {'loss': 1.3173, 'grad_norm': 0.8008496299830612, 'learning_rate': 1.9803799658748096e-05, 'epoch': 0.16} 16%|█▌ | 150/951 [02:46<13:31, 1.01s/it] 16%|█▌ | 151/951 [02:47<13:30, 1.01s/it] 16%|█▌ | 152/951 [02:48<13:28, 1.01s/it] 16%|█▌ | 153/951 [02:49<13:29, 1.01s/it] 16%|█▌ | 154/951 [02:50<13:33, 1.02s/it] 16%|█▋ | 155/951 [02:51<13:30, 1.02s/it] {'loss': 1.3193, 'grad_norm': 0.8976369540265376, 'learning_rate': 1.976593310542718e-05, 'epoch': 0.16} 16%|█▋ | 155/951 [02:51<13:30, 1.02s/it] 16%|█▋ | 156/951 [02:52<13:29, 1.02s/it] 17%|█▋ | 157/951 [02:53<13:26, 1.02s/it] 17%|█▋ | 158/951 [02:54<13:24, 1.01s/it] 17%|█▋ | 159/951 [02:55<13:22, 1.01s/it] 17%|█▋ | 160/951 [02:56<13:21, 1.01s/it] {'loss': 1.3074, 'grad_norm': 0.8219320759522418, 'learning_rate': 1.9724770386266363e-05, 'epoch': 0.17} 17%|█▋ | 160/951 [02:56<13:21, 1.01s/it] 17%|█▋ | 161/951 [02:57<13:20, 1.01s/it] 17%|█▋ | 162/951 [02:58<13:19, 1.01s/it] 17%|█▋ | 163/951 [02:59<13:17, 1.01s/it] 17%|█▋ | 164/951 [03:00<13:16, 1.01s/it] 17%|█▋ | 165/951 [03:01<13:14, 1.01s/it] {'loss': 1.3229, 'grad_norm': 0.7917666416050304, 'learning_rate': 1.968032539437215e-05, 'epoch': 0.17} 17%|█▋ | 165/951 [03:01<13:14, 1.01s/it] 17%|█▋ | 166/951 [03:02<13:14, 1.01s/it] 18%|█▊ | 167/951 [03:03<13:13, 1.01s/it] 18%|█▊ | 168/951 [03:04<13:12, 1.01s/it] 18%|█▊ | 169/951 [03:05<13:12, 1.01s/it] 18%|█▊ | 170/951 [03:06<13:11, 1.01s/it] {'loss': 1.3053, 'grad_norm': 0.8468899498707804, 'learning_rate': 1.963261313067302e-05, 'epoch': 0.18} 18%|█▊ | 170/951 [03:06<13:11, 1.01s/it] 18%|█▊ | 171/951 [03:07<13:11, 1.01s/it] 18%|█▊ | 172/951 [03:08<13:08, 1.01s/it] 18%|█▊ | 173/951 [03:09<13:07, 1.01s/it] 18%|█▊ | 174/951 [03:11<14:02, 1.08s/it] 18%|█▊ | 175/951 [03:12<13:44, 1.06s/it] {'loss': 1.2994, 'grad_norm': 0.7826255604907222, 'learning_rate': 1.958164969885636e-05, 'epoch': 0.18} 18%|█▊ | 175/951 [03:12<13:44, 1.06s/it] 19%|█▊ | 176/951 [03:13<13:33, 1.05s/it] 19%|█▊ | 177/951 [03:14<13:23, 1.04s/it] 19%|█▊ | 178/951 [03:15<13:16, 1.03s/it] 19%|█▉ | 179/951 [03:16<13:10, 1.02s/it] 19%|█▉ | 180/951 [03:17<13:07, 1.02s/it] {'loss': 1.2933, 'grad_norm': 0.7854651229191463, 'learning_rate': 1.9527452299933192e-05, 'epoch': 0.19} 19%|█▉ | 180/951 [03:17<13:07, 1.02s/it] 19%|█▉ | 181/951 [03:18<13:05, 1.02s/it] 19%|█▉ | 182/951 [03:19<13:04, 1.02s/it] 19%|█▉ | 183/951 [03:20<13:01, 1.02s/it] 19%|█▉ | 184/951 [03:21<12:59, 1.02s/it] 19%|█▉ | 185/951 [03:22<12:57, 1.01s/it] {'loss': 1.3053, 'grad_norm': 0.8335518596689603, 'learning_rate': 1.9470039226432562e-05, 'epoch': 0.19} 19%|█▉ | 185/951 [03:22<12:57, 1.01s/it] 20%|█▉ | 186/951 [03:23<12:56, 1.01s/it] 20%|█▉ | 187/951 [03:24<12:54, 1.01s/it] 20%|█▉ | 188/951 [03:25<12:53, 1.01s/it] 20%|█▉ | 189/951 [03:26<12:52, 1.01s/it] 20%|█▉ | 190/951 [03:27<12:51, 1.01s/it] {'loss': 1.3118, 'grad_norm': 0.7925970741738505, 'learning_rate': 1.9409429856227487e-05, 'epoch': 0.2} 20%|█▉ | 190/951 [03:27<12:51, 1.01s/it] 20%|██ | 191/951 [03:28<12:51, 1.01s/it] 20%|██ | 192/951 [03:29<12:49, 1.01s/it] 20%|██ | 193/951 [03:30<12:47, 1.01s/it] 20%|██ | 194/951 [03:31<12:45, 1.01s/it] 21%|██ | 195/951 [03:32<12:44, 1.01s/it] {'loss': 1.3006, 'grad_norm': 0.9022667633817576, 'learning_rate': 1.934564464599461e-05, 'epoch': 0.21} 21%|██ | 195/951 [03:32<12:44, 1.01s/it] 21%|██ | 196/951 [03:33<12:44, 1.01s/it] 21%|██ | 197/951 [03:34<12:44, 1.01s/it] 21%|██ | 198/951 [03:35<12:44, 1.02s/it] 21%|██ | 199/951 [03:36<12:49, 1.02s/it] 21%|██ | 200/951 [03:37<12:46, 1.02s/it] {'loss': 1.3019, 'grad_norm': 0.7846381821366918, 'learning_rate': 1.9278705124309724e-05, 'epoch': 0.21} 21%|██ | 200/951 [03:37<12:46, 1.02s/it][INFO|trainer.py:3503] 2024-10-10 13:16:32,779 >> Saving model checkpoint to /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-200 [INFO|configuration_utils.py:472] 2024-10-10 13:16:32,790 >> Configuration saved in /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-200/config.json [INFO|configuration_utils.py:807] 2024-10-10 13:16:32,794 >> Configuration saved in /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-200/generation_config.json [INFO|modeling_utils.py:2799] 2024-10-10 13:16:34,996 >> Model weights saved in /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-200/model.safetensors [INFO|tokenization_utils_base.py:2684] 2024-10-10 13:16:35,002 >> tokenizer config file saved in /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-200/tokenizer_config.json [INFO|tokenization_utils_base.py:2693] 2024-10-10 13:16:35,005 >> Special tokens file saved in /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-200/special_tokens_map.json [2024-10-10 13:16:35,139] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint global_step200 is about to be saved! [2024-10-10 13:16:35,147] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-200/global_step200/zero_pp_rank_0_mp_rank_00_model_states.pt [2024-10-10 13:16:35,147] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-200/global_step200/zero_pp_rank_0_mp_rank_00_model_states.pt... [2024-10-10 13:16:35,167] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-200/global_step200/zero_pp_rank_0_mp_rank_00_model_states.pt. [2024-10-10 13:16:35,177] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-200/global_step200/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... [2024-10-10 13:16:38,388] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-200/global_step200/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. [2024-10-10 13:16:38,394] [INFO] [engine.py:3536:_save_zero_checkpoint] zero checkpoint saved /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-200/global_step200/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt [2024-10-10 13:16:38,717] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step200 is ready now! [INFO|trainer.py:3595] 2024-10-10 13:16:38,734 >> Deleting older checkpoint [/workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-100] due to args.save_total_limit /workspace/jihaozhe/miniconda3/envs/handbook/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined] 21%|██ | 201/951 [03:47<46:00, 3.68s/it] 21%|██ | 202/951 [03:48<35:56, 2.88s/it] 21%|██▏ | 203/951 [03:49<28:54, 2.32s/it] 21%|██▏ | 204/951 [03:50<23:59, 1.93s/it] 22%|██▏ | 205/951 [03:51<20:31, 1.65s/it] {'loss': 1.3096, 'grad_norm': 0.8523088106258838, 'learning_rate': 1.9208633884381528e-05, 'epoch': 0.22} 22%|██▏ | 205/951 [03:51<20:31, 1.65s/it] 22%|██▏ | 206/951 [03:52<18:08, 1.46s/it] 22%|██▏ | 207/951 [03:53<16:27, 1.33s/it] 22%|██▏ | 208/951 [03:54<15:14, 1.23s/it] 22%|██▏ | 209/951 [03:55<14:24, 1.16s/it] 22%|██▏ | 210/951 [03:56<13:49, 1.12s/it] {'loss': 1.3292, 'grad_norm': 0.7773410227959319, 'learning_rate': 1.913545457642601e-05, 'epoch': 0.22} 22%|██▏ | 210/951 [03:56<13:49, 1.12s/it] 22%|██▏ | 211/951 [03:57<13:25, 1.09s/it] 22%|██▏ | 212/951 [03:58<13:08, 1.07s/it] 22%|██▏ | 213/951 [03:59<12:54, 1.05s/it] 23%|██▎ | 214/951 [04:00<12:46, 1.04s/it] 23%|██▎ | 215/951 [04:01<12:38, 1.03s/it] {'loss': 1.3039, 'grad_norm': 0.7700372006553495, 'learning_rate': 1.9059191899684154e-05, 'epoch': 0.23} 23%|██▎ | 215/951 [04:01<12:38, 1.03s/it] 23%|██▎ | 216/951 [04:02<12:34, 1.03s/it] 23%|██▎ | 217/951 [04:03<12:30, 1.02s/it] 23%|██▎ | 218/951 [04:04<12:27, 1.02s/it] 23%|██▎ | 219/951 [04:05<12:24, 1.02s/it] 23%|██▎ | 220/951 [04:06<12:22, 1.02s/it] {'loss': 1.2877, 'grad_norm': 0.888501634972992, 'learning_rate': 1.8979871594085482e-05, 'epoch': 0.23} 23%|██▎ | 220/951 [04:06<12:22, 1.02s/it] 23%|██▎ | 221/951 [04:07<12:22, 1.02s/it] 23%|██▎ | 222/951 [04:08<12:20, 1.02s/it] 23%|██▎ | 223/951 [04:09<12:18, 1.01s/it] 24%|██▎ | 224/951 [04:10<12:17, 1.01s/it] 24%|██▎ | 225/951 [04:11<12:15, 1.01s/it] {'loss': 1.3015, 'grad_norm': 1.1002462859379447, 'learning_rate': 1.8897520431560435e-05, 'epoch': 0.24} 24%|██▎ | 225/951 [04:11<12:15, 1.01s/it] 24%|██▍ | 226/951 [04:12<12:15, 1.01s/it] 24%|██▍ | 227/951 [04:13<12:14, 1.01s/it] 24%|██▍ | 228/951 [04:14<12:13, 1.01s/it] 24%|██▍ | 229/951 [04:15<12:11, 1.01s/it] 24%|██▍ | 230/951 [04:16<12:10, 1.01s/it] {'loss': 1.3115, 'grad_norm': 0.8737204056440837, 'learning_rate': 1.881216620700437e-05, 'epoch': 0.24} 24%|██▍ | 230/951 [04:16<12:10, 1.01s/it] 24%|██▍ | 231/951 [04:17<12:09, 1.01s/it] 24%|██▍ | 232/951 [04:18<12:17, 1.03s/it] 25%|██▍ | 233/951 [04:19<12:13, 1.02s/it] 25%|██▍ | 234/951 [04:20<12:10, 1.02s/it] 25%|██▍ | 235/951 [04:21<12:07, 1.02s/it] {'loss': 1.3046, 'grad_norm': 0.789475981373434, 'learning_rate': 1.872383772889634e-05, 'epoch': 0.25} 25%|██▍ | 235/951 [04:21<12:07, 1.02s/it] 25%|██▍ | 236/951 [04:22<12:07, 1.02s/it] 25%|██▍ | 237/951 [04:24<12:39, 1.06s/it] 25%|██▌ | 238/951 [04:25<12:27, 1.05s/it] 25%|██▌ | 239/951 [04:26<12:19, 1.04s/it] 25%|██▌ | 240/951 [04:27<12:12, 1.03s/it] {'loss': 1.314, 'grad_norm': 0.7857837110627846, 'learning_rate': 1.863256480957574e-05, 'epoch': 0.25} 25%|██▌ | 240/951 [04:27<12:12, 1.03s/it] 25%|██▌ | 241/951 [04:28<12:09, 1.03s/it] 25%|██▌ | 242/951 [04:29<12:04, 1.02s/it] 26%|██▌ | 243/951 [04:30<12:01, 1.02s/it] 26%|██▌ | 244/951 [04:31<11:59, 1.02s/it] 26%|██▌ | 245/951 [04:32<11:57, 1.02s/it] {'loss': 1.2965, 'grad_norm': 0.8735939515101088, 'learning_rate': 1.853837825518014e-05, 'epoch': 0.26} 26%|██▌ | 245/951 [04:32<11:57, 1.02s/it] 26%|██▌ | 246/951 [04:33<11:56, 1.02s/it] 26%|██▌ | 247/951 [04:34<11:54, 1.01s/it] 26%|██▌ | 248/951 [04:35<11:53, 1.01s/it] 26%|██▌ | 249/951 [04:36<11:52, 1.01s/it] 26%|██▋ | 250/951 [04:37<11:49, 1.01s/it] {'loss': 1.2847, 'grad_norm': 0.8274285162573444, 'learning_rate': 1.844130985524771e-05, 'epoch': 0.26} 26%|██▋ | 250/951 [04:37<11:49, 1.01s/it] 26%|██▋ | 251/951 [04:38<11:49, 1.01s/it] 26%|██▋ | 252/951 [04:39<11:48, 1.01s/it] 27%|██▋ | 253/951 [04:40<11:46, 1.01s/it] 27%|██▋ | 254/951 [04:41<11:46, 1.01s/it] 27%|██▋ | 255/951 [04:42<11:44, 1.01s/it] {'loss': 1.3033, 'grad_norm': 0.9655358070442068, 'learning_rate': 1.83413923719877e-05, 'epoch': 0.27} 27%|██▋ | 255/951 [04:42<11:44, 1.01s/it] 27%|██▋ | 256/951 [04:43<11:44, 1.01s/it] 27%|██▋ | 257/951 [04:44<11:43, 1.01s/it] 27%|██▋ | 258/951 [04:45<11:42, 1.01s/it] 27%|██▋ | 259/951 [04:46<11:40, 1.01s/it] 27%|██▋ | 260/951 [04:47<11:40, 1.01s/it] {'loss': 1.2964, 'grad_norm': 0.784933865330852, 'learning_rate': 1.8238659529222672e-05, 'epoch': 0.27} 27%|██▋ | 260/951 [04:47<11:40, 1.01s/it] 27%|██▋ | 261/951 [04:48<11:39, 1.01s/it] 28%|██▊ | 262/951 [04:49<11:37, 1.01s/it] 28%|██▊ | 263/951 [04:50<11:37, 1.01s/it] 28%|██▊ | 264/951 [04:51<11:36, 1.01s/it] 28%|██▊ | 265/951 [04:52<11:34, 1.01s/it] {'loss': 1.3, 'grad_norm': 0.8072089928871935, 'learning_rate': 1.813314600100612e-05, 'epoch': 0.28} 28%|██▊ | 265/951 [04:52<11:34, 1.01s/it] 28%|██▊ | 266/951 [04:53<11:34, 1.01s/it] 28%|██▊ | 267/951 [04:54<11:32, 1.01s/it] 28%|██▊ | 268/951 [04:55<11:31, 1.01s/it] 28%|██▊ | 269/951 [04:56<11:31, 1.01s/it] 28%|██▊ | 270/951 [04:57<11:30, 1.01s/it] {'loss': 1.2897, 'grad_norm': 0.7600500534210821, 'learning_rate': 1.802488739991941e-05, 'epoch': 0.28} 28%|██▊ | 270/951 [04:57<11:30, 1.01s/it] 28%|██▊ | 271/951 [04:58<11:30, 1.02s/it] 29%|██▊ | 272/951 [04:59<11:29, 1.02s/it] 29%|██▊ | 273/951 [05:00<11:28, 1.01s/it] 29%|██▉ | 274/951 [05:01<11:25, 1.01s/it] 29%|██▉ | 275/951 [05:02<11:24, 1.01s/it] {'loss': 1.2994, 'grad_norm': 0.7716489846603405, 'learning_rate': 1.7913920265051947e-05, 'epoch': 0.29} 29%|██▉ | 275/951 [05:02<11:24, 1.01s/it] 29%|██▉ | 276/951 [05:03<11:24, 1.01s/it] 29%|██▉ | 277/951 [05:04<11:23, 1.01s/it] 29%|██▉ | 278/951 [05:05<11:22, 1.01s/it] 29%|██▉ | 279/951 [05:06<11:21, 1.01s/it] 29%|██▉ | 280/951 [05:07<11:20, 1.01s/it] {'loss': 1.3146, 'grad_norm': 0.8439416659266935, 'learning_rate': 1.7800282049668593e-05, 'epoch': 0.29} 29%|██▉ | 280/951 [05:07<11:20, 1.01s/it] 30%|██▉ | 281/951 [05:08<11:20, 1.02s/it] 30%|██▉ | 282/951 [05:09<11:18, 1.01s/it] 30%|██▉ | 283/951 [05:10<11:17, 1.01s/it] 30%|██▉ | 284/951 [05:11<11:15, 1.01s/it] 30%|██▉ | 285/951 [05:12<11:14, 1.01s/it] {'loss': 1.3157, 'grad_norm': 0.7413366975663128, 'learning_rate': 1.7684011108568593e-05, 'epoch': 0.3} 30%|██▉ | 285/951 [05:12<11:14, 1.01s/it] 30%|███ | 286/951 [05:13<11:14, 1.01s/it] 30%|███ | 287/951 [05:14<11:13, 1.01s/it] 30%|███ | 288/951 [05:15<11:11, 1.01s/it] 30%|███ | 289/951 [05:16<11:10, 1.01s/it] 30%|███ | 290/951 [05:17<11:09, 1.01s/it] {'loss': 1.2944, 'grad_norm': 0.9324094146103826, 'learning_rate': 1.7565146685140168e-05, 'epoch': 0.3} 30%|███ | 290/951 [05:17<11:09, 1.01s/it] 31%|███ | 291/951 [05:18<11:09, 1.01s/it] 31%|███ | 292/951 [05:19<11:07, 1.01s/it] 31%|███ | 293/951 [05:20<11:07, 1.01s/it] 31%|███ | 294/951 [05:21<11:06, 1.01s/it] 31%|███ | 295/951 [05:22<11:37, 1.06s/it] {'loss': 1.3041, 'grad_norm': 0.7497525070350954, 'learning_rate': 1.7443728898115228e-05, 'epoch': 0.31} 31%|███ | 295/951 [05:22<11:37, 1.06s/it] 31%|███ | 296/951 [05:23<11:27, 1.05s/it] 31%|███ | 297/951 [05:25<12:09, 1.12s/it] 31%|███▏ | 298/951 [05:26<11:47, 1.08s/it] 31%|███▏ | 299/951 [05:27<11:33, 1.06s/it] 32%|███▏ | 300/951 [05:28<11:21, 1.05s/it] {'loss': 1.2855, 'grad_norm': 0.7274267632257158, 'learning_rate': 1.7319798728028617e-05, 'epoch': 0.32} 32%|███▏ | 300/951 [05:28<11:21, 1.05s/it][INFO|trainer.py:3503] 2024-10-10 13:18:23,559 >> Saving model checkpoint to /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-300 [INFO|configuration_utils.py:472] 2024-10-10 13:18:23,570 >> Configuration saved in /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-300/config.json [INFO|configuration_utils.py:807] 2024-10-10 13:18:23,573 >> Configuration saved in /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-300/generation_config.json [INFO|modeling_utils.py:2799] 2024-10-10 13:18:25,763 >> Model weights saved in /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-300/model.safetensors [INFO|tokenization_utils_base.py:2684] 2024-10-10 13:18:25,769 >> tokenizer config file saved in /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-300/tokenizer_config.json [INFO|tokenization_utils_base.py:2693] 2024-10-10 13:18:25,772 >> Special tokens file saved in /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-300/special_tokens_map.json [2024-10-10 13:18:25,908] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint global_step300 is about to be saved! [2024-10-10 13:18:25,915] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-300/global_step300/zero_pp_rank_0_mp_rank_00_model_states.pt [2024-10-10 13:18:25,915] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-300/global_step300/zero_pp_rank_0_mp_rank_00_model_states.pt... [2024-10-10 13:18:25,929] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-300/global_step300/zero_pp_rank_0_mp_rank_00_model_states.pt. [2024-10-10 13:18:25,939] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-300/global_step300/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... [2024-10-10 13:18:28,761] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-300/global_step300/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. [2024-10-10 13:18:28,767] [INFO] [engine.py:3536:_save_zero_checkpoint] zero checkpoint saved /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-300/global_step300/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt [2024-10-10 13:18:29,330] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step300 is ready now! [INFO|trainer.py:3595] 2024-10-10 13:18:29,347 >> Deleting older checkpoint [/workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-200] due to args.save_total_limit /workspace/jihaozhe/miniconda3/envs/handbook/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined] 32%|███▏ | 301/951 [05:38<40:30, 3.74s/it] 32%|███▏ | 302/951 [05:39<31:35, 2.92s/it] 32%|███▏ | 303/951 [05:40<25:21, 2.35s/it] 32%|███▏ | 304/951 [05:41<20:59, 1.95s/it] 32%|███▏ | 305/951 [05:42<17:56, 1.67s/it] {'loss': 1.2967, 'grad_norm': 0.7306525895599777, 'learning_rate': 1.7193398003386514e-05, 'epoch': 0.32} 32%|███▏ | 305/951 [05:42<17:56, 1.67s/it] 32%|███▏ | 306/951 [05:43<15:49, 1.47s/it] 32%|███▏ | 307/951 [05:44<14:18, 1.33s/it] 32%|███▏ | 308/951 [05:45<13:15, 1.24s/it] 32%|███▏ | 309/951 [05:46<12:30, 1.17s/it] 33%|███▎ | 310/951 [05:47<11:59, 1.12s/it] {'loss': 1.3116, 'grad_norm': 0.7718504682998596, 'learning_rate': 1.7064569386548586e-05, 'epoch': 0.33} 33%|███▎ | 310/951 [05:47<11:59, 1.12s/it] 33%|███▎ | 311/951 [05:48<11:38, 1.09s/it] 33%|███▎ | 312/951 [05:49<11:22, 1.07s/it] 33%|███▎ | 313/951 [05:50<11:10, 1.05s/it] 33%|███▎ | 314/951 [05:51<11:02, 1.04s/it] 33%|███▎ | 315/951 [05:52<10:56, 1.03s/it] {'loss': 1.2812, 'grad_norm': 0.7002164572114049, 'learning_rate': 1.6933356359328756e-05, 'epoch': 0.33} 33%|███▎ | 315/951 [05:52<10:56, 1.03s/it] 33%|███▎ | 316/951 [05:53<10:51, 1.03s/it] 33%|███▎ | 317/951 [05:54<10:48, 1.02s/it] 33%|███▎ | 318/951 [05:55<10:44, 1.02s/it] 34%|███▎ | 319/951 [05:56<10:42, 1.02s/it] 34%|███▎ | 320/951 [05:57<10:40, 1.02s/it] {'loss': 1.2654, 'grad_norm': 0.757112323690967, 'learning_rate': 1.679980320831934e-05, 'epoch': 0.34} 34%|███▎ | 320/951 [05:57<10:40, 1.02s/it] 34%|███▍ | 321/951 [05:58<10:39, 1.02s/it] 34%|███▍ | 322/951 [05:59<10:38, 1.02s/it] 34%|███▍ | 323/951 [06:00<10:36, 1.01s/it] 34%|███▍ | 324/951 [06:01<10:35, 1.01s/it] 34%|███▍ | 325/951 [06:02<10:35, 1.01s/it] {'loss': 1.2755, 'grad_norm': 0.7509591301953, 'learning_rate': 1.6663955009943603e-05, 'epoch': 0.34} 34%|███▍ | 325/951 [06:02<10:35, 1.01s/it] 34%|███▍ | 326/951 [06:03<10:34, 1.02s/it] 34%|███▍ | 327/951 [06:04<10:33, 1.02s/it] 34%|███▍ | 328/951 [06:05<10:31, 1.01s/it] 35%|███▍ | 329/951 [06:06<10:30, 1.01s/it] 35%|███▍ | 330/951 [06:07<10:29, 1.01s/it] {'loss': 1.291, 'grad_norm': 0.7896423206563724, 'learning_rate': 1.6525857615241686e-05, 'epoch': 0.35} 35%|███▍ | 330/951 [06:07<10:29, 1.01s/it] 35%|███▍ | 331/951 [06:08<10:28, 1.01s/it] 35%|███▍ | 332/951 [06:09<10:27, 1.01s/it] 35%|███▌ | 333/951 [06:10<10:25, 1.01s/it] 35%|███▌ | 334/951 [06:11<10:24, 1.01s/it] 35%|███▌ | 335/951 [06:12<10:24, 1.01s/it] {'loss': 1.3, 'grad_norm': 0.8538357892614057, 'learning_rate': 1.6385557634395138e-05, 'epoch': 0.35} 35%|███▌ | 335/951 [06:12<10:24, 1.01s/it] 35%|███▌ | 336/951 [06:13<10:24, 1.02s/it] 35%|███▌ | 337/951 [06:14<10:22, 1.01s/it] 36%|███▌ | 338/951 [06:15<10:21, 1.01s/it] 36%|███▌ | 339/951 [06:16<10:20, 1.01s/it] 36%|███▌ | 340/951 [06:17<10:19, 1.01s/it] {'loss': 1.2825, 'grad_norm': 0.8257603179264333, 'learning_rate': 1.624310242099518e-05, 'epoch': 0.36} 36%|███▌ | 340/951 [06:17<10:19, 1.01s/it] 36%|███▌ | 341/951 [06:18<10:18, 1.01s/it] 36%|███▌ | 342/951 [06:19<10:18, 1.02s/it] 36%|███▌ | 343/951 [06:20<10:17, 1.02s/it] 36%|███▌ | 344/951 [06:21<10:16, 1.02s/it] 36%|███▋ | 345/951 [06:22<10:14, 1.01s/it] {'loss': 1.2903, 'grad_norm': 0.7506565139625698, 'learning_rate': 1.609854005606009e-05, 'epoch': 0.36} 36%|███▋ | 345/951 [06:22<10:14, 1.01s/it] 36%|███▋ | 346/951 [06:23<10:14, 1.02s/it] 36%|███▋ | 347/951 [06:24<10:12, 1.01s/it] 37%|███▋ | 348/951 [06:25<10:11, 1.01s/it] 37%|███▋ | 349/951 [06:26<10:09, 1.01s/it] 37%|███▋ | 350/951 [06:27<10:09, 1.01s/it] {'loss': 1.32, 'grad_norm': 0.7962631813094616, 'learning_rate': 1.5951919331807052e-05, 'epoch': 0.37} 37%|███▋ | 350/951 [06:27<10:09, 1.01s/it] 37%|███▋ | 351/951 [06:28<10:08, 1.01s/it] 37%|███▋ | 352/951 [06:29<10:06, 1.01s/it] 37%|███▋ | 353/951 [06:31<10:06, 1.01s/it] 37%|███▋ | 354/951 [06:32<10:04, 1.01s/it] 37%|███▋ | 355/951 [06:33<10:29, 1.06s/it] {'loss': 1.3094, 'grad_norm': 0.7846004527594455, 'learning_rate': 1.5803289735183952e-05, 'epoch': 0.37} 37%|███▋ | 355/951 [06:33<10:29, 1.06s/it] 37%|███▋ | 356/951 [06:34<10:21, 1.04s/it] 38%|███▊ | 357/951 [06:35<11:00, 1.11s/it] 38%|███▊ | 358/951 [06:36<10:41, 1.08s/it] 38%|███▊ | 359/951 [06:37<10:28, 1.06s/it] 38%|███▊ | 360/951 [06:38<10:18, 1.05s/it] {'loss': 1.3097, 'grad_norm': 0.7529193912570731, 'learning_rate': 1.565270143116672e-05, 'epoch': 0.38} 38%|███▊ | 360/951 [06:38<10:18, 1.05s/it] 38%|███▊ | 361/951 [06:39<10:12, 1.04s/it] 38%|███▊ | 362/951 [06:40<10:06, 1.03s/it] 38%|███▊ | 363/951 [06:41<10:02, 1.03s/it] 38%|███▊ | 364/951 [06:42<09:58, 1.02s/it] 38%|███▊ | 365/951 [06:43<09:56, 1.02s/it] {'loss': 1.2954, 'grad_norm': 0.8450472817355228, 'learning_rate': 1.5500205245827814e-05, 'epoch': 0.38} 38%|███▊ | 365/951 [06:43<09:56, 1.02s/it] 38%|███▊ | 366/951 [06:44<09:55, 1.02s/it] 39%|███▊ | 367/951 [06:45<09:53, 1.02s/it] 39%|███▊ | 368/951 [06:46<09:51, 1.02s/it] 39%|███▉ | 369/951 [06:47<09:50, 1.01s/it] 39%|███▉ | 370/951 [06:48<09:48, 1.01s/it] {'loss': 1.2774, 'grad_norm': 0.7884988874909659, 'learning_rate': 1.5345852649181555e-05, 'epoch': 0.39} 39%|███▉ | 370/951 [06:48<09:48, 1.01s/it] 39%|███▉ | 371/951 [06:49<09:49, 1.02s/it] 39%|███▉ | 372/951 [06:50<09:47, 1.02s/it] 39%|███▉ | 373/951 [06:51<09:46, 1.02s/it] 39%|███▉ | 374/951 [06:52<09:45, 1.01s/it] 39%|███▉ | 375/951 [06:53<09:44, 1.01s/it] {'loss': 1.2788, 'grad_norm': 0.7181038933510216, 'learning_rate': 1.5189695737812153e-05, 'epoch': 0.39} 39%|███▉ | 375/951 [06:53<09:44, 1.01s/it] 40%|███▉ | 376/951 [06:54<09:44, 1.02s/it] 40%|███▉ | 377/951 [06:55<09:42, 1.02s/it] 40%|███▉ | 378/951 [06:56<09:42, 1.02s/it] 40%|███▉ | 379/951 [06:57<09:41, 1.02s/it] 40%|███▉ | 380/951 [06:58<09:40, 1.02s/it] {'loss': 1.2825, 'grad_norm': 0.7398759340722635, 'learning_rate': 1.503178721729022e-05, 'epoch': 0.4} 40%|███▉ | 380/951 [06:58<09:40, 1.02s/it] 40%|████ | 381/951 [06:59<09:39, 1.02s/it] 40%|████ | 382/951 [07:00<09:38, 1.02s/it] 40%|████ | 383/951 [07:01<09:36, 1.02s/it] 40%|████ | 384/951 [07:02<09:34, 1.01s/it] 40%|████ | 385/951 [07:03<09:33, 1.01s/it] {'loss': 1.2945, 'grad_norm': 0.6946338532175111, 'learning_rate': 1.4872180384383772e-05, 'epoch': 0.4} 40%|████ | 385/951 [07:03<09:33, 1.01s/it] 41%|████ | 386/951 [07:04<09:33, 1.01s/it] 41%|████ | 387/951 [07:05<09:31, 1.01s/it] 41%|████ | 388/951 [07:06<09:30, 1.01s/it] 41%|████ | 389/951 [07:07<09:29, 1.01s/it] 41%|████ | 390/951 [07:08<09:28, 1.01s/it] {'loss': 1.2774, 'grad_norm': 0.7312230396618198, 'learning_rate': 1.4710929109069674e-05, 'epoch': 0.41} 41%|████ | 390/951 [07:08<09:28, 1.01s/it] 41%|████ | 391/951 [07:09<09:28, 1.02s/it] 41%|████ | 392/951 [07:10<09:28, 1.02s/it] 41%|████▏ | 393/951 [07:11<09:26, 1.02s/it] 41%|████▏ | 394/951 [07:13<09:25, 1.01s/it] 42%|████▏ | 395/951 [07:14<09:23, 1.01s/it] {'loss': 1.2691, 'grad_norm': 0.685269793573933, 'learning_rate': 1.4548087816351616e-05, 'epoch': 0.42} 42%|████▏ | 395/951 [07:14<09:23, 1.01s/it] 42%|████▏ | 396/951 [07:15<09:23, 1.02s/it] 42%|████▏ | 397/951 [07:16<09:22, 1.01s/it] 42%|████▏ | 398/951 [07:17<09:21, 1.01s/it] 42%|████▏ | 399/951 [07:18<09:20, 1.01s/it] 42%|████▏ | 400/951 [07:19<09:19, 1.01s/it] {'loss': 1.3029, 'grad_norm': 0.7367025121344211, 'learning_rate': 1.4383711467890776e-05, 'epoch': 0.42} 42%|████▏ | 400/951 [07:19<09:19, 1.01s/it][INFO|trainer.py:3503] 2024-10-10 13:20:14,440 >> Saving model checkpoint to /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-400 [INFO|configuration_utils.py:472] 2024-10-10 13:20:14,474 >> Configuration saved in /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-400/config.json [INFO|configuration_utils.py:807] 2024-10-10 13:20:14,477 >> Configuration saved in /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-400/generation_config.json [INFO|modeling_utils.py:2799] 2024-10-10 13:20:16,652 >> Model weights saved in /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-400/model.safetensors [INFO|tokenization_utils_base.py:2684] 2024-10-10 13:20:16,658 >> tokenizer config file saved in /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-400/tokenizer_config.json [INFO|tokenization_utils_base.py:2693] 2024-10-10 13:20:16,660 >> Special tokens file saved in /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-400/special_tokens_map.json [2024-10-10 13:20:16,800] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint global_step400 is about to be saved! [2024-10-10 13:20:16,807] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-400/global_step400/zero_pp_rank_0_mp_rank_00_model_states.pt [2024-10-10 13:20:16,807] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-400/global_step400/zero_pp_rank_0_mp_rank_00_model_states.pt... [2024-10-10 13:20:16,825] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-400/global_step400/zero_pp_rank_0_mp_rank_00_model_states.pt. [2024-10-10 13:20:16,835] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-400/global_step400/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... [2024-10-10 13:20:19,785] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-400/global_step400/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. [2024-10-10 13:20:19,793] [INFO] [engine.py:3536:_save_zero_checkpoint] zero checkpoint saved /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-400/global_step400/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt [2024-10-10 13:20:20,166] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step400 is ready now! [INFO|trainer.py:3595] 2024-10-10 13:20:20,199 >> Deleting older checkpoint [/workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-300] due to args.save_total_limit /workspace/jihaozhe/miniconda3/envs/handbook/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined] 42%|████▏ | 401/951 [07:29<34:15, 3.74s/it] 42%|████▏ | 402/951 [07:30<26:43, 2.92s/it] 42%|████▏ | 403/951 [07:31<21:26, 2.35s/it] 42%|████▏ | 404/951 [07:32<17:44, 1.95s/it] 43%|████▎ | 405/951 [07:33<15:10, 1.67s/it] {'loss': 1.2846, 'grad_norm': 0.954670307639884, 'learning_rate': 1.4217855543455323e-05, 'epoch': 0.43} 43%|████▎ | 405/951 [07:33<15:10, 1.67s/it] 43%|████▎ | 406/951 [07:34<13:22, 1.47s/it] 43%|████▎ | 407/951 [07:35<12:05, 1.33s/it] 43%|████▎ | 408/951 [07:36<11:12, 1.24s/it] 43%|████▎ | 409/951 [07:37<10:35, 1.17s/it] 43%|████▎ | 410/951 [07:38<10:07, 1.12s/it] {'loss': 1.2686, 'grad_norm': 0.7884436143443709, 'learning_rate': 1.4050576022195084e-05, 'epoch': 0.43} 43%|████▎ | 410/951 [07:38<10:07, 1.12s/it] 43%|████▎ | 411/951 [07:39<09:49, 1.09s/it] 43%|████▎ | 412/951 [07:40<09:37, 1.07s/it] 43%|████▎ | 413/951 [07:41<09:31, 1.06s/it] 44%|████▎ | 414/951 [07:42<09:23, 1.05s/it] 44%|████▎ | 415/951 [07:43<09:16, 1.04s/it] {'loss': 1.2717, 'grad_norm': 0.6923648258189327, 'learning_rate': 1.3881929363747628e-05, 'epoch': 0.44} 44%|████▎ | 415/951 [07:43<09:16, 1.04s/it] 44%|████▎ | 416/951 [07:44<09:12, 1.03s/it] 44%|████▍ | 417/951 [07:45<09:07, 1.03s/it] 44%|████▍ | 418/951 [07:46<09:04, 1.02s/it] 44%|████▍ | 419/951 [07:47<09:38, 1.09s/it] 44%|████▍ | 420/951 [07:48<09:25, 1.07s/it] {'loss': 1.2968, 'grad_norm': 0.7131098366130528, 'learning_rate': 1.3711972489182208e-05, 'epoch': 0.44} 44%|████▍ | 420/951 [07:48<09:25, 1.07s/it] 44%|████▍ | 421/951 [07:49<09:17, 1.05s/it] 44%|████▍ | 422/951 [07:50<09:09, 1.04s/it] 44%|████▍ | 423/951 [07:51<09:04, 1.03s/it] 45%|████▍ | 424/951 [07:52<09:00, 1.02s/it] 45%|████▍ | 425/951 [07:53<08:57, 1.02s/it] {'loss': 1.2829, 'grad_norm': 0.7320310034272568, 'learning_rate': 1.3540762761787938e-05, 'epoch': 0.45} 45%|████▍ | 425/951 [07:53<08:57, 1.02s/it] 45%|████▍ | 426/951 [07:54<08:55, 1.02s/it] 45%|████▍ | 427/951 [07:55<08:53, 1.02s/it] 45%|████▌ | 428/951 [07:56<08:51, 1.02s/it] 45%|████▌ | 429/951 [07:57<08:50, 1.02s/it] 45%|████▌ | 430/951 [07:58<08:48, 1.01s/it] {'loss': 1.2877, 'grad_norm': 0.7327893476336677, 'learning_rate': 1.3368357967712726e-05, 'epoch': 0.45} 45%|████▌ | 430/951 [07:58<08:48, 1.01s/it] 45%|████▌ | 431/951 [07:59<08:48, 1.02s/it] 45%|████▌ | 432/951 [08:00<08:46, 1.02s/it] 46%|████▌ | 433/951 [08:01<08:46, 1.02s/it] 46%|████▌ | 434/951 [08:02<08:45, 1.02s/it] 46%|████▌ | 435/951 [08:03<08:43, 1.02s/it] {'loss': 1.2871, 'grad_norm': 0.6970014096675695, 'learning_rate': 1.3194816296459483e-05, 'epoch': 0.46} 46%|████▌ | 435/951 [08:03<08:43, 1.02s/it] 46%|████▌ | 436/951 [08:04<08:42, 1.02s/it] 46%|████▌ | 437/951 [08:05<08:41, 1.02s/it] 46%|████▌ | 438/951 [08:06<08:40, 1.02s/it] 46%|████▌ | 439/951 [08:07<08:39, 1.01s/it] 46%|████▋ | 440/951 [08:09<08:37, 1.01s/it] {'loss': 1.2897, 'grad_norm': 0.7014388522833548, 'learning_rate': 1.302019632124619e-05, 'epoch': 0.46} 46%|████▋ | 440/951 [08:09<08:37, 1.01s/it] 46%|████▋ | 441/951 [08:10<08:37, 1.01s/it] 46%|████▋ | 442/951 [08:11<08:36, 1.01s/it] 47%|████▋ | 443/951 [08:12<08:34, 1.01s/it] 47%|████▋ | 444/951 [08:13<08:34, 1.01s/it] 47%|████▋ | 445/951 [08:14<08:33, 1.01s/it] {'loss': 1.2714, 'grad_norm': 0.6970153763179125, 'learning_rate': 1.2844556979236463e-05, 'epoch': 0.47} 47%|████▋ | 445/951 [08:14<08:33, 1.01s/it] 47%|████▋ | 446/951 [08:15<08:32, 1.02s/it] 47%|████▋ | 447/951 [08:16<08:31, 1.01s/it] 47%|████▋ | 448/951 [08:17<08:30, 1.01s/it] 47%|████▋ | 449/951 [08:18<08:29, 1.01s/it] 47%|████▋ | 450/951 [08:19<08:28, 1.01s/it] {'loss': 1.2705, 'grad_norm': 0.7162117080997548, 'learning_rate': 1.2667957551647263e-05, 'epoch': 0.47} 47%|████▋ | 450/951 [08:19<08:28, 1.01s/it] 47%|████▋ | 451/951 [08:20<08:27, 1.02s/it] 48%|████▊ | 452/951 [08:21<08:26, 1.01s/it] 48%|████▊ | 453/951 [08:22<08:24, 1.01s/it] 48%|████▊ | 454/951 [08:23<08:23, 1.01s/it] 48%|████▊ | 455/951 [08:24<08:22, 1.01s/it] {'loss': 1.2654, 'grad_norm': 0.7402201632812351, 'learning_rate': 1.24904576437405e-05, 'epoch': 0.48} 48%|████▊ | 455/951 [08:24<08:22, 1.01s/it] 48%|████▊ | 456/951 [08:25<08:22, 1.01s/it] 48%|████▊ | 457/951 [08:26<08:20, 1.01s/it] 48%|████▊ | 458/951 [08:27<08:20, 1.01s/it] 48%|████▊ | 459/951 [08:28<08:19, 1.01s/it] 48%|████▊ | 460/951 [08:29<08:18, 1.02s/it] {'loss': 1.2784, 'grad_norm': 1.0567871681807486, 'learning_rate': 1.2312117164705267e-05, 'epoch': 0.48} 48%|████▊ | 460/951 [08:29<08:18, 1.02s/it] 48%|████▊ | 461/951 [08:30<08:17, 1.02s/it] 49%|████▊ | 462/951 [08:31<08:16, 1.01s/it] 49%|████▊ | 463/951 [08:32<08:15, 1.01s/it] 49%|████▉ | 464/951 [08:33<08:13, 1.01s/it] 49%|████▉ | 465/951 [08:34<08:12, 1.01s/it] {'loss': 1.2574, 'grad_norm': 0.7383322032141622, 'learning_rate': 1.213299630743747e-05, 'epoch': 0.49} 49%|████▉ | 465/951 [08:34<08:12, 1.01s/it] 49%|████▉ | 466/951 [08:35<08:12, 1.01s/it] 49%|████▉ | 467/951 [08:36<08:10, 1.01s/it] 49%|████▉ | 468/951 [08:37<08:09, 1.01s/it] 49%|████▉ | 469/951 [08:38<08:08, 1.01s/it] 49%|████▉ | 470/951 [08:39<08:07, 1.01s/it] {'loss': 1.2861, 'grad_norm': 0.7402615331947396, 'learning_rate': 1.1953155528223728e-05, 'epoch': 0.49} 49%|████▉ | 470/951 [08:39<08:07, 1.01s/it] 50%|████▉ | 471/951 [08:40<08:07, 1.02s/it] 50%|████▉ | 472/951 [08:41<08:05, 1.01s/it] 50%|████▉ | 473/951 [08:42<08:04, 1.01s/it] 50%|████▉ | 474/951 [08:43<08:04, 1.01s/it] 50%|████▉ | 475/951 [08:44<08:02, 1.01s/it] {'loss': 1.2899, 'grad_norm': 0.7073342447220505, 'learning_rate': 1.1772655526336367e-05, 'epoch': 0.5} 50%|████▉ | 475/951 [08:44<08:02, 1.01s/it] 50%|█████ | 476/951 [08:45<08:02, 1.02s/it] 50%|█████ | 477/951 [08:46<08:02, 1.02s/it] 50%|█████ | 478/951 [08:47<08:06, 1.03s/it] 50%|█████ | 479/951 [08:48<08:03, 1.02s/it] 50%|█████ | 480/951 [08:49<08:21, 1.06s/it] {'loss': 1.2607, 'grad_norm': 0.6966270957094003, 'learning_rate': 1.1591557223546394e-05, 'epoch': 0.5} 50%|█████ | 480/951 [08:49<08:21, 1.06s/it] 51%|█████ | 481/951 [08:50<08:13, 1.05s/it] 51%|█████ | 482/951 [08:51<08:06, 1.04s/it] 51%|█████ | 483/951 [08:52<08:02, 1.03s/it] 51%|█████ | 484/951 [08:53<07:58, 1.02s/it] 51%|█████ | 485/951 [08:54<07:56, 1.02s/it] {'loss': 1.285, 'grad_norm': 0.6995026613750213, 'learning_rate': 1.1409921743561383e-05, 'epoch': 0.51} 51%|█████ | 485/951 [08:54<07:56, 1.02s/it] 51%|█████ | 486/951 [08:55<07:54, 1.02s/it] 51%|█████ | 487/951 [08:56<07:52, 1.02s/it] 51%|█████▏ | 488/951 [08:57<07:50, 1.02s/it] 51%|█████▏ | 489/951 [08:58<07:49, 1.02s/it] 52%|█████▏ | 490/951 [08:59<07:47, 1.01s/it] {'loss': 1.2787, 'grad_norm': 0.677915660036085, 'learning_rate': 1.1227810391395199e-05, 'epoch': 0.52} 52%|█████▏ | 490/951 [08:59<07:47, 1.01s/it] 52%|█████▏ | 491/951 [09:00<07:46, 1.01s/it] 52%|█████▏ | 492/951 [09:01<07:45, 1.01s/it] 52%|█████▏ | 493/951 [09:02<07:45, 1.02s/it] 52%|█████▏ | 494/951 [09:03<07:43, 1.02s/it] 52%|█████▏ | 495/951 [09:04<07:42, 1.01s/it] {'loss': 1.2817, 'grad_norm': 0.694058172582696, 'learning_rate': 1.1045284632676535e-05, 'epoch': 0.52} 52%|█████▏ | 495/951 [09:04<07:42, 1.01s/it] 52%|█████▏ | 496/951 [09:06<07:42, 1.02s/it] 52%|█████▏ | 497/951 [09:07<07:40, 1.02s/it] 52%|█████▏ | 498/951 [09:08<07:39, 1.02s/it] 52%|█████▏ | 499/951 [09:09<07:38, 1.01s/it] 53%|█████▎ | 500/951 [09:10<07:36, 1.01s/it] {'loss': 1.269, 'grad_norm': 0.6977571213673606, 'learning_rate': 1.0862406072903224e-05, 'epoch': 0.53} 53%|█████▎ | 500/951 [09:10<07:36, 1.01s/it][INFO|trainer.py:3503] 2024-10-10 13:22:05,339 >> Saving model checkpoint to /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-500 [INFO|configuration_utils.py:472] 2024-10-10 13:22:05,350 >> Configuration saved in /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-500/config.json [INFO|configuration_utils.py:807] 2024-10-10 13:22:05,353 >> Configuration saved in /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-500/generation_config.json [INFO|modeling_utils.py:2799] 2024-10-10 13:22:08,199 >> Model weights saved in /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-500/model.safetensors [INFO|tokenization_utils_base.py:2684] 2024-10-10 13:22:08,205 >> tokenizer config file saved in /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-500/tokenizer_config.json [INFO|tokenization_utils_base.py:2693] 2024-10-10 13:22:08,208 >> Special tokens file saved in /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-500/special_tokens_map.json [2024-10-10 13:22:08,339] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint global_step500 is about to be saved! [2024-10-10 13:22:08,347] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-500/global_step500/zero_pp_rank_0_mp_rank_00_model_states.pt [2024-10-10 13:22:08,347] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-500/global_step500/zero_pp_rank_0_mp_rank_00_model_states.pt... [2024-10-10 13:22:08,362] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-500/global_step500/zero_pp_rank_0_mp_rank_00_model_states.pt. [2024-10-10 13:22:08,371] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-500/global_step500/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... [2024-10-10 13:22:11,387] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-500/global_step500/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. [2024-10-10 13:22:11,393] [INFO] [engine.py:3536:_save_zero_checkpoint] zero checkpoint saved /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-500/global_step500/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt [2024-10-10 13:22:11,629] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step500 is ready now! [INFO|trainer.py:3595] 2024-10-10 13:22:11,650 >> Deleting older checkpoint [/workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-400] due to args.save_total_limit /workspace/jihaozhe/miniconda3/envs/handbook/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined] 53%|█████▎ | 501/951 [09:20<28:36, 3.81s/it] 53%|█████▎ | 502/951 [09:21<22:15, 2.97s/it] 53%|█████▎ | 503/951 [09:22<17:48, 2.39s/it] 53%|█████▎ | 504/951 [09:23<14:41, 1.97s/it] 53%|█████▎ | 505/951 [09:24<12:31, 1.69s/it] {'loss': 1.2569, 'grad_norm': 0.709669977534319, 'learning_rate': 1.067923643664936e-05, 'epoch': 0.53} 53%|█████▎ | 505/951 [09:24<12:31, 1.69s/it] 53%|█████▎ | 506/951 [09:25<11:00, 1.48s/it] 53%|█████▎ | 507/951 [09:26<09:56, 1.34s/it] 53%|█████▎ | 508/951 [09:27<09:10, 1.24s/it] 54%|█████▎ | 509/951 [09:28<08:38, 1.17s/it] 54%|█████▎ | 510/951 [09:29<08:16, 1.12s/it] {'loss': 1.2722, 'grad_norm': 0.700607260206098, 'learning_rate': 1.0495837546732224e-05, 'epoch': 0.54} 54%|█████▎ | 510/951 [09:29<08:16, 1.12s/it] 54%|█████▎ | 511/951 [09:30<08:01, 1.09s/it] 54%|█████▍ | 512/951 [09:31<07:49, 1.07s/it] 54%|█████▍ | 513/951 [09:32<07:40, 1.05s/it] 54%|█████▍ | 514/951 [09:33<07:34, 1.04s/it] 54%|█████▍ | 515/951 [09:34<07:30, 1.03s/it] {'loss': 1.2689, 'grad_norm': 0.6965062537462676, 'learning_rate': 1.031227130334604e-05, 'epoch': 0.54} 54%|█████▍ | 515/951 [09:34<07:30, 1.03s/it] 54%|█████▍ | 516/951 [09:35<07:26, 1.03s/it] 54%|█████▍ | 517/951 [09:36<07:23, 1.02s/it] 54%|█████▍ | 518/951 [09:37<07:21, 1.02s/it] 55%|█████▍ | 519/951 [09:38<07:19, 1.02s/it] 55%|█████▍ | 520/951 [09:39<07:18, 1.02s/it] {'loss': 1.3171, 'grad_norm': 0.6975551930599063, 'learning_rate': 1.0128599663169629e-05, 'epoch': 0.55} 55%|█████▍ | 520/951 [09:39<07:18, 1.02s/it] 55%|█████▍ | 521/951 [09:40<07:18, 1.02s/it] 55%|█████▍ | 522/951 [09:41<07:16, 1.02s/it] 55%|█████▍ | 523/951 [09:42<07:14, 1.02s/it] 55%|█████▌ | 524/951 [09:43<07:14, 1.02s/it] 55%|█████▌ | 525/951 [09:44<07:12, 1.02s/it] {'loss': 1.2616, 'grad_norm': 0.6835553883087109, 'learning_rate': 9.944884618454996e-06, 'epoch': 0.55} 55%|█████▌ | 525/951 [09:44<07:12, 1.02s/it] 55%|█████▌ | 526/951 [09:45<07:12, 1.02s/it] 55%|█████▌ | 527/951 [09:46<07:11, 1.02s/it] 56%|█████▌ | 528/951 [09:47<07:09, 1.02s/it] 56%|█████▌ | 529/951 [09:48<07:08, 1.02s/it] 56%|█████▌ | 530/951 [09:49<07:06, 1.01s/it] {'loss': 1.2842, 'grad_norm': 0.6932270302779978, 'learning_rate': 9.761188176103902e-06, 'epoch': 0.56} 56%|█████▌ | 530/951 [09:49<07:06, 1.01s/it] 56%|█████▌ | 531/951 [09:50<07:06, 1.01s/it] 56%|█████▌ | 532/951 [09:51<07:04, 1.01s/it] 56%|█████▌ | 533/951 [09:52<07:04, 1.01s/it] 56%|█████▌ | 534/951 [09:53<07:02, 1.01s/it] 56%|█████▋ | 535/951 [09:54<07:01, 1.01s/it] {'loss': 1.276, 'grad_norm': 0.7523295705383632, 'learning_rate': 9.577572336739491e-06, 'epoch': 0.56} 56%|█████▋ | 535/951 [09:54<07:01, 1.01s/it] 56%|█████▋ | 536/951 [09:55<07:01, 1.02s/it] 56%|█████▋ | 537/951 [09:56<07:00, 1.02s/it] 57%|█████▋ | 538/951 [09:57<06:58, 1.01s/it] 57%|█████▋ | 539/951 [09:58<06:58, 1.01s/it] 57%|█████▋ | 540/951 [10:00<07:15, 1.06s/it] {'loss': 1.2451, 'grad_norm': 0.7410406370549468, 'learning_rate': 9.394099073780066e-06, 'epoch': 0.57} 57%|█████▋ | 540/951 [10:00<07:15, 1.06s/it] 57%|█████▋ | 541/951 [10:01<07:08, 1.05s/it] 57%|█████▋ | 542/951 [10:02<07:03, 1.03s/it] 57%|█████▋ | 543/951 [10:03<07:30, 1.10s/it] 57%|█████▋ | 544/951 [10:04<07:17, 1.08s/it] 57%|█████▋ | 545/951 [10:05<07:09, 1.06s/it] {'loss': 1.2505, 'grad_norm': 0.6626519082062838, 'learning_rate': 9.210830312521991e-06, 'epoch': 0.57} 57%|█████▋ | 545/951 [10:05<07:09, 1.06s/it] 57%|█████▋ | 546/951 [10:06<07:03, 1.05s/it] 58%|█████▊ | 547/951 [10:07<06:58, 1.04s/it] 58%|█████▊ | 548/951 [10:08<06:54, 1.03s/it] 58%|█████▊ | 549/951 [10:09<06:52, 1.03s/it] 58%|█████▊ | 550/951 [10:10<06:50, 1.02s/it] {'loss': 1.2884, 'grad_norm': 0.678178932637677, 'learning_rate': 9.027827909238902e-06, 'epoch': 0.58} 58%|█████▊ | 550/951 [10:10<06:50, 1.02s/it] 58%|█████▊ | 551/951 [10:11<06:48, 1.02s/it] 58%|█████▊ | 552/951 [10:12<06:46, 1.02s/it] 58%|█████▊ | 553/951 [10:13<06:45, 1.02s/it] 58%|█████▊ | 554/951 [10:14<06:43, 1.02s/it] 58%|█████▊ | 555/951 [10:15<06:42, 1.02s/it] {'loss': 1.2674, 'grad_norm': 0.6964235915302415, 'learning_rate': 8.84515363030414e-06, 'epoch': 0.58} 58%|█████▊ | 555/951 [10:15<06:42, 1.02s/it] 58%|█████▊ | 556/951 [10:16<06:42, 1.02s/it] 59%|█████▊ | 557/951 [10:17<06:40, 1.02s/it] 59%|█████▊ | 558/951 [10:18<06:38, 1.01s/it] 59%|█████▉ | 559/951 [10:19<06:37, 1.01s/it] 59%|█████▉ | 560/951 [10:20<06:36, 1.01s/it] {'loss': 1.2606, 'grad_norm': 0.7295327929832226, 'learning_rate': 8.662869131343607e-06, 'epoch': 0.59} 59%|█████▉ | 560/951 [10:20<06:36, 1.01s/it] 59%|█████▉ | 561/951 [10:21<06:35, 1.01s/it] 59%|█████▉ | 562/951 [10:22<06:34, 1.01s/it] 59%|█████▉ | 563/951 [10:23<06:33, 1.01s/it] 59%|█████▉ | 564/951 [10:24<06:32, 1.01s/it] 59%|█████▉ | 565/951 [10:25<06:31, 1.01s/it] {'loss': 1.2631, 'grad_norm': 0.6671644394075068, 'learning_rate': 8.481035936425928e-06, 'epoch': 0.59} 59%|█████▉ | 565/951 [10:25<06:31, 1.01s/it] 60%|█████▉ | 566/951 [10:26<06:30, 1.01s/it] 60%|█████▉ | 567/951 [10:27<06:29, 1.01s/it] 60%|█████▉ | 568/951 [10:28<06:28, 1.01s/it] 60%|█████▉ | 569/951 [10:29<06:27, 1.01s/it] 60%|█████▉ | 570/951 [10:30<06:26, 1.01s/it] {'loss': 1.2733, 'grad_norm': 0.6918129382316475, 'learning_rate': 8.299715417297072e-06, 'epoch': 0.6} 60%|█████▉ | 570/951 [10:30<06:26, 1.01s/it] 60%|██████ | 571/951 [10:31<06:25, 1.02s/it] 60%|██████ | 572/951 [10:32<06:24, 1.01s/it] 60%|██████ | 573/951 [10:33<06:23, 1.01s/it] 60%|██████ | 574/951 [10:34<06:22, 1.01s/it] 60%|██████ | 575/951 [10:35<06:21, 1.02s/it] {'loss': 1.2768, 'grad_norm': 0.7648575778497364, 'learning_rate': 8.118968772666338e-06, 'epoch': 0.6} 60%|██████ | 575/951 [10:35<06:21, 1.02s/it] 61%|██████ | 576/951 [10:36<06:21, 1.02s/it] 61%|██████ | 577/951 [10:37<06:20, 1.02s/it] 61%|██████ | 578/951 [10:38<06:18, 1.02s/it] 61%|██████ | 579/951 [10:39<06:17, 1.01s/it] 61%|██████ | 580/951 [10:40<06:16, 1.01s/it] {'loss': 1.2712, 'grad_norm': 0.6660976751002976, 'learning_rate': 7.938857007550797e-06, 'epoch': 0.61} 61%|██████ | 580/951 [10:40<06:16, 1.01s/it] 61%|██████ | 581/951 [10:41<06:15, 1.02s/it] 61%|██████ | 582/951 [10:42<06:14, 1.01s/it] 61%|██████▏ | 583/951 [10:43<06:13, 1.02s/it] 61%|██████▏ | 584/951 [10:44<06:12, 1.01s/it] 62%|██████▏ | 585/951 [10:46<06:11, 1.01s/it] {'loss': 1.2629, 'grad_norm': 0.8206761231157318, 'learning_rate': 7.759440912685043e-06, 'epoch': 0.62} 62%|██████▏ | 585/951 [10:46<06:11, 1.01s/it] 62%|██████▏ | 586/951 [10:47<06:10, 1.02s/it] 62%|██████▏ | 587/951 [10:48<06:09, 1.01s/it] 62%|██████▏ | 588/951 [10:49<06:08, 1.01s/it] 62%|██████▏ | 589/951 [10:50<06:07, 1.01s/it] 62%|██████▏ | 590/951 [10:51<06:05, 1.01s/it] {'loss': 1.2928, 'grad_norm': 0.6781893460495839, 'learning_rate': 7.580781044003324e-06, 'epoch': 0.62} 62%|██████▏ | 590/951 [10:51<06:05, 1.01s/it] 62%|██████▏ | 591/951 [10:52<06:05, 1.02s/it] 62%|██████▏ | 592/951 [10:53<06:04, 1.01s/it] 62%|██████▏ | 593/951 [10:54<06:03, 1.02s/it] 62%|██████▏ | 594/951 [10:55<06:02, 1.01s/it] 63%|██████▎ | 595/951 [10:56<06:01, 1.02s/it] {'loss': 1.2565, 'grad_norm': 0.7043738811990014, 'learning_rate': 7.402937702200905e-06, 'epoch': 0.63} 63%|██████▎ | 595/951 [10:56<06:01, 1.02s/it] 63%|██████▎ | 596/951 [10:57<06:00, 1.02s/it] 63%|██████▎ | 597/951 [10:58<05:59, 1.02s/it] 63%|██████▎ | 598/951 [10:59<05:57, 1.01s/it] 63%|██████▎ | 599/951 [11:00<05:56, 1.01s/it] 63%|██████▎ | 600/951 [11:01<05:56, 1.01s/it] {'loss': 1.2441, 'grad_norm': 0.6676378108729303, 'learning_rate': 7.225970912381557e-06, 'epoch': 0.63} 63%|██████▎ | 600/951 [11:01<05:56, 1.01s/it][INFO|trainer.py:3503] 2024-10-10 13:23:56,563 >> Saving model checkpoint to /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-600 [INFO|configuration_utils.py:472] 2024-10-10 13:23:56,580 >> Configuration saved in /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-600/config.json [INFO|configuration_utils.py:807] 2024-10-10 13:23:56,585 >> Configuration saved in /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-600/generation_config.json [INFO|modeling_utils.py:2799] 2024-10-10 13:23:59,978 >> Model weights saved in /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-600/model.safetensors [INFO|tokenization_utils_base.py:2684] 2024-10-10 13:23:59,984 >> tokenizer config file saved in /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-600/tokenizer_config.json [INFO|tokenization_utils_base.py:2693] 2024-10-10 13:23:59,987 >> Special tokens file saved in /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-600/special_tokens_map.json [2024-10-10 13:24:00,110] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint global_step600 is about to be saved! [2024-10-10 13:24:00,129] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-600/global_step600/zero_pp_rank_0_mp_rank_00_model_states.pt [2024-10-10 13:24:00,129] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-600/global_step600/zero_pp_rank_0_mp_rank_00_model_states.pt... [2024-10-10 13:24:00,140] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-600/global_step600/zero_pp_rank_0_mp_rank_00_model_states.pt. [2024-10-10 13:24:00,142] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-600/global_step600/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... [2024-10-10 13:24:03,377] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-600/global_step600/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. [2024-10-10 13:24:03,383] [INFO] [engine.py:3536:_save_zero_checkpoint] zero checkpoint saved /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-600/global_step600/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt [2024-10-10 13:24:03,385] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step600 is ready now! [INFO|trainer.py:3595] 2024-10-10 13:24:03,402 >> Deleting older checkpoint [/workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-500] due to args.save_total_limit /workspace/jihaozhe/miniconda3/envs/handbook/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined] 63%|██████▎ | 601/951 [11:12<23:27, 4.02s/it] 63%|██████▎ | 602/951 [11:13<18:33, 3.19s/it] 63%|██████▎ | 603/951 [11:14<14:42, 2.54s/it] 64%|██████▎ | 604/951 [11:15<12:01, 2.08s/it] 64%|██████▎ | 605/951 [11:16<10:08, 1.76s/it] {'loss': 1.2526, 'grad_norm': 0.7086319165657909, 'learning_rate': 7.04994040379809e-06, 'epoch': 0.64} 64%|██████▎ | 605/951 [11:16<10:08, 1.76s/it] 64%|██████▎ | 606/951 [11:17<08:49, 1.54s/it] 64%|██████▍ | 607/951 [11:18<07:54, 1.38s/it] 64%|██████▍ | 608/951 [11:19<07:15, 1.27s/it] 64%|██████▍ | 609/951 [11:20<06:47, 1.19s/it] 64%|██████▍ | 610/951 [11:21<06:27, 1.14s/it] {'loss': 1.2689, 'grad_norm': 0.6584513029407596, 'learning_rate': 6.874905589692734e-06, 'epoch': 0.64} 64%|██████▍ | 610/951 [11:21<06:27, 1.14s/it] 64%|██████▍ | 611/951 [11:22<06:14, 1.10s/it] 64%|██████▍ | 612/951 [11:23<06:04, 1.07s/it] 64%|██████▍ | 613/951 [11:24<05:56, 1.06s/it] 65%|██████▍ | 614/951 [11:25<05:51, 1.04s/it] 65%|██████▍ | 615/951 [11:26<05:47, 1.03s/it] {'loss': 1.254, 'grad_norm': 0.6556498134167102, 'learning_rate': 6.700925547244173e-06, 'epoch': 0.65} 65%|██████▍ | 615/951 [11:26<05:47, 1.03s/it] 65%|██████▍ | 616/951 [11:27<05:44, 1.03s/it] 65%|██████▍ | 617/951 [11:28<05:42, 1.02s/it] 65%|██████▍ | 618/951 [11:29<05:40, 1.02s/it] 65%|██████▌ | 619/951 [11:30<05:38, 1.02s/it] 65%|██████▌ | 620/951 [11:31<05:36, 1.02s/it] {'loss': 1.2773, 'grad_norm': 0.7317497625816267, 'learning_rate': 6.528058997627995e-06, 'epoch': 0.65} 65%|██████▌ | 620/951 [11:31<05:36, 1.02s/it] 65%|██████▌ | 621/951 [11:32<05:35, 1.02s/it] 65%|██████▌ | 622/951 [11:33<05:34, 1.02s/it] 66%|██████▌ | 623/951 [11:34<05:32, 1.01s/it] 66%|██████▌ | 624/951 [11:35<05:31, 1.01s/it] 66%|██████▌ | 625/951 [11:36<05:30, 1.01s/it] {'loss': 1.2774, 'grad_norm': 0.6943495622984553, 'learning_rate': 6.356364286197341e-06, 'epoch': 0.66} 66%|██████▌ | 625/951 [11:36<05:30, 1.01s/it] 66%|██████▌ | 626/951 [11:37<05:29, 1.01s/it] 66%|██████▌ | 627/951 [11:38<05:28, 1.02s/it] 66%|██████▌ | 628/951 [11:39<05:27, 1.01s/it] 66%|██████▌ | 629/951 [11:40<05:26, 1.01s/it] 66%|██████▌ | 630/951 [11:41<05:24, 1.01s/it] {'loss': 1.2561, 'grad_norm': 0.6557703327216338, 'learning_rate': 6.18589936279034e-06, 'epoch': 0.66} 66%|██████▌ | 630/951 [11:41<05:24, 1.01s/it] 66%|██████▋ | 631/951 [11:42<05:24, 1.01s/it] 66%|██████▋ | 632/951 [11:43<05:23, 1.01s/it] 67%|██████▋ | 633/951 [11:44<05:22, 1.01s/it] 67%|██████▋ | 634/951 [11:45<05:21, 1.02s/it] 67%|██████▋ | 635/951 [11:46<05:20, 1.01s/it] {'loss': 1.2592, 'grad_norm': 0.7479106302240468, 'learning_rate': 6.016721762171098e-06, 'epoch': 0.67} 67%|██████▋ | 635/951 [11:46<05:20, 1.01s/it] 67%|██████▋ | 636/951 [11:47<05:19, 1.02s/it] 67%|██████▋ | 637/951 [11:49<05:18, 1.02s/it] 67%|██████▋ | 638/951 [11:50<05:17, 1.01s/it] 67%|██████▋ | 639/951 [11:51<05:16, 1.01s/it] 67%|██████▋ | 640/951 [11:52<05:15, 1.01s/it] {'loss': 1.2647, 'grad_norm': 0.6501670860792287, 'learning_rate': 5.848888584610727e-06, 'epoch': 0.67} 67%|██████▋ | 640/951 [11:52<05:15, 1.01s/it] 67%|██████▋ | 641/951 [11:53<05:14, 1.02s/it] 68%|██████▊ | 642/951 [11:54<05:13, 1.01s/it] 68%|██████▊ | 643/951 [11:55<05:12, 1.02s/it] 68%|██████▊ | 644/951 [11:56<05:11, 1.01s/it] 68%|██████▊ | 645/951 [11:57<05:10, 1.01s/it] {'loss': 1.2687, 'grad_norm': 0.6750656043866252, 'learning_rate': 5.6824564766150724e-06, 'epoch': 0.68} 68%|██████▊ | 645/951 [11:57<05:10, 1.01s/it] 68%|██████▊ | 646/951 [11:58<05:09, 1.02s/it] 68%|██████▊ | 647/951 [11:59<05:08, 1.02s/it] 68%|██████▊ | 648/951 [12:00<05:07, 1.02s/it] 68%|██████▊ | 649/951 [12:01<05:06, 1.01s/it] 68%|██████▊ | 650/951 [12:02<05:05, 1.01s/it] {'loss': 1.2658, 'grad_norm': 0.707162362118668, 'learning_rate': 5.51748161180554e-06, 'epoch': 0.68} 68%|██████▊ | 650/951 [12:02<05:05, 1.01s/it] 68%|██████▊ | 651/951 [12:03<05:04, 1.01s/it] 69%|██████▊ | 652/951 [12:04<05:03, 1.01s/it] 69%|██████▊ | 653/951 [12:05<05:02, 1.01s/it] 69%|██████▉ | 654/951 [12:06<05:01, 1.01s/it] 69%|██████▉ | 655/951 [12:07<04:59, 1.01s/it] {'loss': 1.2618, 'grad_norm': 0.638182422688773, 'learning_rate': 5.354019671959601e-06, 'epoch': 0.69} 69%|██████▉ | 655/951 [12:07<04:59, 1.01s/it] 69%|██████▉ | 656/951 [12:08<04:59, 1.02s/it] 69%|██████▉ | 657/951 [12:09<04:58, 1.02s/it] 69%|██████▉ | 658/951 [12:10<04:56, 1.01s/it] 69%|██████▉ | 659/951 [12:11<04:56, 1.01s/it] 69%|██████▉ | 660/951 [12:12<04:54, 1.01s/it] {'loss': 1.265, 'grad_norm': 0.6799258857804185, 'learning_rate': 5.192125828217203e-06, 'epoch': 0.69} 69%|██████▉ | 660/951 [12:12<04:54, 1.01s/it] 70%|██████▉ | 661/951 [12:13<05:08, 1.06s/it] 70%|██████▉ | 662/951 [12:14<05:02, 1.05s/it] 70%|██████▉ | 663/951 [12:15<04:58, 1.04s/it] 70%|██████▉ | 664/951 [12:16<05:15, 1.10s/it] 70%|██████▉ | 665/951 [12:17<05:06, 1.07s/it] {'loss': 1.2726, 'grad_norm': 0.6614260729055945, 'learning_rate': 5.0318547224596525e-06, 'epoch': 0.7} 70%|██████▉ | 665/951 [12:17<05:06, 1.07s/it] 70%|███████ | 666/951 [12:18<05:01, 1.06s/it] 70%|███████ | 667/951 [12:19<04:56, 1.04s/it] 70%|███████ | 668/951 [12:20<04:52, 1.04s/it] 70%|███████ | 669/951 [12:21<04:49, 1.03s/it] 70%|███████ | 670/951 [12:22<04:47, 1.02s/it] {'loss': 1.2635, 'grad_norm': 0.6571261971890797, 'learning_rate': 4.873260448867004e-06, 'epoch': 0.7} 70%|███████ | 670/951 [12:22<04:47, 1.02s/it] 71%|███████ | 671/951 [12:23<04:45, 1.02s/it] 71%|███████ | 672/951 [12:24<04:44, 1.02s/it] 71%|███████ | 673/951 [12:25<04:42, 1.02s/it] 71%|███████ | 674/951 [12:26<04:41, 1.02s/it] 71%|███████ | 675/951 [12:27<04:40, 1.01s/it] {'loss': 1.2571, 'grad_norm': 0.6434154288185444, 'learning_rate': 4.716396535660412e-06, 'epoch': 0.71} 71%|███████ | 675/951 [12:27<04:40, 1.01s/it] 71%|███████ | 676/951 [12:28<04:39, 1.02s/it] 71%|███████ | 677/951 [12:29<04:38, 1.02s/it] 71%|███████▏ | 678/951 [12:30<04:37, 1.01s/it] 71%|███████▏ | 679/951 [12:31<04:36, 1.01s/it] 72%|███████▏ | 680/951 [12:33<04:34, 1.01s/it] {'loss': 1.2768, 'grad_norm': 0.6507843588939136, 'learning_rate': 4.5613159270354455e-06, 'epoch': 0.72} 72%|███████▏ | 680/951 [12:33<04:34, 1.01s/it] 72%|███████▏ | 681/951 [12:34<04:33, 1.01s/it] 72%|███████▏ | 682/951 [12:35<04:32, 1.01s/it] 72%|███████▏ | 683/951 [12:36<04:32, 1.02s/it] 72%|███████▏ | 684/951 [12:37<04:30, 1.01s/it] 72%|███████▏ | 685/951 [12:38<04:29, 1.01s/it] {'loss': 1.2729, 'grad_norm': 0.6888038425645421, 'learning_rate': 4.408070965292534e-06, 'epoch': 0.72} 72%|███████▏ | 685/951 [12:38<04:29, 1.01s/it] 72%|███████▏ | 686/951 [12:39<04:28, 1.01s/it] 72%|███████▏ | 687/951 [12:40<04:27, 1.01s/it] 72%|███████▏ | 688/951 [12:41<04:26, 1.01s/it] 72%|███████▏ | 689/951 [12:42<04:25, 1.01s/it] 73%|███████▎ | 690/951 [12:43<04:24, 1.01s/it] {'loss': 1.2754, 'grad_norm': 0.6548677910438372, 'learning_rate': 4.256713373170565e-06, 'epoch': 0.73} 73%|███████▎ | 690/951 [12:43<04:24, 1.01s/it] 73%|███████▎ | 691/951 [12:44<04:23, 1.02s/it] 73%|███████▎ | 692/951 [12:45<04:22, 1.02s/it] 73%|███████▎ | 693/951 [12:46<04:21, 1.02s/it] 73%|███████▎ | 694/951 [12:47<04:20, 1.02s/it] 73%|███████▎ | 695/951 [12:48<04:19, 1.01s/it] {'loss': 1.2428, 'grad_norm': 0.6924489060189015, 'learning_rate': 4.107294236389603e-06, 'epoch': 0.73} 73%|███████▎ | 695/951 [12:48<04:19, 1.01s/it] 73%|███████▎ | 696/951 [12:49<04:19, 1.02s/it] 73%|███████▎ | 697/951 [12:50<04:20, 1.02s/it] 73%|███████▎ | 698/951 [12:51<04:18, 1.02s/it] 74%|███████▎ | 699/951 [12:52<04:16, 1.02s/it] 74%|███████▎ | 700/951 [12:53<04:14, 1.02s/it] {'loss': 1.2422, 'grad_norm': 0.675766971656018, 'learning_rate': 3.959863986408593e-06, 'epoch': 0.74} 74%|███████▎ | 700/951 [12:53<04:14, 1.02s/it][INFO|trainer.py:3503] 2024-10-10 13:25:48,657 >> Saving model checkpoint to /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-700 [INFO|configuration_utils.py:472] 2024-10-10 13:25:48,693 >> Configuration saved in /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-700/config.json [INFO|configuration_utils.py:807] 2024-10-10 13:25:48,696 >> Configuration saved in /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-700/generation_config.json [INFO|modeling_utils.py:2799] 2024-10-10 13:25:51,575 >> Model weights saved in /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-700/model.safetensors [INFO|tokenization_utils_base.py:2684] 2024-10-10 13:25:51,581 >> tokenizer config file saved in /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-700/tokenizer_config.json [INFO|tokenization_utils_base.py:2693] 2024-10-10 13:25:51,584 >> Special tokens file saved in /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-700/special_tokens_map.json [2024-10-10 13:25:51,709] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint global_step700 is about to be saved! [2024-10-10 13:25:51,717] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-700/global_step700/zero_pp_rank_0_mp_rank_00_model_states.pt [2024-10-10 13:25:51,717] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-700/global_step700/zero_pp_rank_0_mp_rank_00_model_states.pt... [2024-10-10 13:25:51,735] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-700/global_step700/zero_pp_rank_0_mp_rank_00_model_states.pt. [2024-10-10 13:25:51,746] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-700/global_step700/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... [2024-10-10 13:25:54,846] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-700/global_step700/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. [2024-10-10 13:25:54,851] [INFO] [engine.py:3536:_save_zero_checkpoint] zero checkpoint saved /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-700/global_step700/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt [2024-10-10 13:25:54,853] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step700 is ready now! [INFO|trainer.py:3595] 2024-10-10 13:25:54,871 >> Deleting older checkpoint [/workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-600] due to args.save_total_limit /workspace/jihaozhe/miniconda3/envs/handbook/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined] 74%|███████▎ | 701/951 [13:03<16:04, 3.86s/it] 74%|███████▍ | 702/951 [13:04<12:28, 3.00s/it] 74%|███████▍ | 703/951 [13:05<09:56, 2.41s/it] 74%|███████▍ | 704/951 [13:06<08:10, 1.99s/it] 74%|███████▍ | 705/951 [13:07<06:56, 1.69s/it] {'loss': 1.2596, 'grad_norm': 0.6759780214552317, 'learning_rate': 3.8144723834039076e-06, 'epoch': 0.74} 74%|███████▍ | 705/951 [13:07<06:56, 1.69s/it] 74%|███████▍ | 706/951 [13:08<06:05, 1.49s/it] 74%|███████▍ | 707/951 [13:09<05:28, 1.35s/it] 74%|███████▍ | 708/951 [13:10<05:03, 1.25s/it] 75%|███████▍ | 709/951 [13:11<04:44, 1.18s/it] 75%|███████▍ | 710/951 [13:12<04:31, 1.13s/it] {'loss': 1.2653, 'grad_norm': 0.6577848370025432, 'learning_rate': 3.671168499474449e-06, 'epoch': 0.75} 75%|███████▍ | 710/951 [13:12<04:31, 1.13s/it] 75%|███████▍ | 711/951 [13:13<04:22, 1.09s/it] 75%|███████▍ | 712/951 [13:14<04:15, 1.07s/it] 75%|███████▍ | 713/951 [13:15<04:10, 1.05s/it] 75%|███████▌ | 714/951 [13:16<04:06, 1.04s/it] 75%|███████▌ | 715/951 [13:17<04:03, 1.03s/it] {'loss': 1.2612, 'grad_norm': 0.6816662338694881, 'learning_rate': 3.5300007020789997e-06, 'epoch': 0.75} 75%|███████▌ | 715/951 [13:17<04:03, 1.03s/it] 75%|███████▌ | 716/951 [13:19<04:01, 1.03s/it] 75%|███████▌ | 717/951 [13:20<03:59, 1.02s/it] 75%|███████▌ | 718/951 [13:21<03:57, 1.02s/it] 76%|███████▌ | 719/951 [13:22<03:55, 1.02s/it] 76%|███████▌ | 720/951 [13:23<03:54, 1.02s/it] {'loss': 1.2606, 'grad_norm': 0.6412644189429695, 'learning_rate': 3.3910166377113894e-06, 'epoch': 0.76} 76%|███████▌ | 720/951 [13:23<03:54, 1.02s/it] 76%|███████▌ | 721/951 [13:24<03:53, 1.02s/it] 76%|███████▌ | 722/951 [13:25<03:52, 1.02s/it] 76%|███████▌ | 723/951 [13:26<04:07, 1.09s/it] 76%|███████▌ | 724/951 [13:27<04:01, 1.06s/it] 76%|███████▌ | 725/951 [13:28<03:56, 1.05s/it] {'loss': 1.2499, 'grad_norm': 0.6489342949048617, 'learning_rate': 3.2542632158190135e-06, 'epoch': 0.76} 76%|███████▌ | 725/951 [13:28<03:56, 1.05s/it] 76%|███████▋ | 726/951 [13:29<03:53, 1.04s/it] 76%|███████▋ | 727/951 [13:30<03:50, 1.03s/it] 77%|███████▋ | 728/951 [13:31<03:48, 1.03s/it] 77%|███████▋ | 729/951 [13:32<03:46, 1.02s/it] 77%|███████▋ | 730/951 [13:33<03:45, 1.02s/it] {'loss': 1.2368, 'grad_norm': 0.6305858451193405, 'learning_rate': 3.119786592970102e-06, 'epoch': 0.77} 77%|███████▋ | 730/951 [13:33<03:45, 1.02s/it] 77%|███████▋ | 731/951 [13:34<03:44, 1.02s/it] 77%|███████▋ | 732/951 [13:35<03:42, 1.02s/it] 77%|███████▋ | 733/951 [13:36<03:41, 1.02s/it] 77%|███████▋ | 734/951 [13:37<03:40, 1.01s/it] 77%|███████▋ | 735/951 [13:38<03:39, 1.01s/it] {'loss': 1.2826, 'grad_norm': 0.6601632070842128, 'learning_rate': 2.9876321572751143e-06, 'epoch': 0.77} 77%|███████▋ | 735/951 [13:38<03:39, 1.01s/it] 77%|███████▋ | 736/951 [13:39<03:38, 1.02s/it] 77%|███████▋ | 737/951 [13:40<03:37, 1.01s/it] 78%|███████▊ | 738/951 [13:41<03:36, 1.01s/it] 78%|███████▊ | 739/951 [13:42<03:35, 1.01s/it] 78%|███████▊ | 740/951 [13:43<03:33, 1.01s/it] {'loss': 1.2509, 'grad_norm': 0.6512679140886882, 'learning_rate': 2.8578445130674835e-06, 'epoch': 0.78} 78%|███████▊ | 740/951 [13:43<03:33, 1.01s/it] 78%|███████▊ | 741/951 [13:44<03:33, 1.02s/it] 78%|███████▊ | 742/951 [13:45<03:32, 1.01s/it] 78%|███████▊ | 743/951 [13:46<03:30, 1.01s/it] 78%|███████▊ | 744/951 [13:47<03:29, 1.01s/it] 78%|███████▊ | 745/951 [13:48<03:29, 1.01s/it] {'loss': 1.2593, 'grad_norm': 0.662446849135675, 'learning_rate': 2.7304674658489104e-06, 'epoch': 0.78} 78%|███████▊ | 745/951 [13:48<03:29, 1.01s/it] 78%|███████▊ | 746/951 [13:49<03:28, 1.02s/it] 79%|███████▊ | 747/951 [13:50<03:27, 1.02s/it] 79%|███████▊ | 748/951 [13:51<03:26, 1.01s/it] 79%|███████▉ | 749/951 [13:52<03:25, 1.02s/it] 79%|███████▉ | 750/951 [13:53<03:23, 1.01s/it] {'loss': 1.2696, 'grad_norm': 0.6503725043711072, 'learning_rate': 2.6055440075042793e-06, 'epoch': 0.79} 79%|███████▉ | 750/951 [13:53<03:23, 1.01s/it] 79%|███████▉ | 751/951 [13:54<03:22, 1.01s/it] 79%|███████▉ | 752/951 [13:55<03:21, 1.01s/it] 79%|███████▉ | 753/951 [13:56<03:20, 1.01s/it] 79%|███████▉ | 754/951 [13:57<03:19, 1.01s/it] 79%|███████▉ | 755/951 [13:58<03:18, 1.01s/it] {'loss': 1.2458, 'grad_norm': 0.6349030168998883, 'learning_rate': 2.4831163017911687e-06, 'epoch': 0.79} 79%|███████▉ | 755/951 [13:58<03:18, 1.01s/it] 79%|███████▉ | 756/951 [13:59<03:17, 1.02s/it] 80%|███████▉ | 757/951 [14:00<03:16, 1.01s/it] 80%|███████▉ | 758/951 [14:01<03:15, 1.01s/it] 80%|███████▉ | 759/951 [14:02<03:14, 1.01s/it] 80%|███████▉ | 760/951 [14:03<03:13, 1.01s/it] {'loss': 1.2581, 'grad_norm': 0.6296499888770365, 'learning_rate': 2.3632256701088817e-06, 'epoch': 0.8} 80%|███████▉ | 760/951 [14:03<03:13, 1.01s/it] 80%|████████ | 761/951 [14:04<03:12, 1.01s/it] 80%|████████ | 762/951 [14:05<03:11, 1.01s/it] 80%|████████ | 763/951 [14:06<03:10, 1.01s/it] 80%|████████ | 764/951 [14:07<03:09, 1.01s/it] 80%|████████ | 765/951 [14:08<03:08, 1.01s/it] {'loss': 1.2614, 'grad_norm': 0.6477912073667947, 'learning_rate': 2.2459125775517854e-06, 'epoch': 0.8} 80%|████████ | 765/951 [14:08<03:08, 1.01s/it] 81%|████████ | 766/951 [14:09<03:07, 1.01s/it] 81%|████████ | 767/951 [14:10<03:06, 1.01s/it] 81%|████████ | 768/951 [14:11<03:05, 1.01s/it] 81%|████████ | 769/951 [14:12<03:04, 1.01s/it] 81%|████████ | 770/951 [14:13<03:03, 1.01s/it] {'loss': 1.2707, 'grad_norm': 0.6340147585135972, 'learning_rate': 2.1312166192516593e-06, 'epoch': 0.81} 81%|████████ | 770/951 [14:13<03:03, 1.01s/it] 81%|████████ | 771/951 [14:15<03:02, 1.02s/it] 81%|████████ | 772/951 [14:16<03:01, 1.01s/it] 81%|████████▏ | 773/951 [14:17<03:00, 1.01s/it] 81%|████████▏ | 774/951 [14:18<02:59, 1.01s/it] 81%|████████▏ | 775/951 [14:19<02:58, 1.01s/it] {'loss': 1.2586, 'grad_norm': 0.6535117628009195, 'learning_rate': 2.019176507013677e-06, 'epoch': 0.81} 81%|████████▏ | 775/951 [14:19<02:58, 1.01s/it] 82%|████████▏ | 776/951 [14:20<02:57, 1.02s/it] 82%|████████▏ | 777/951 [14:21<02:56, 1.01s/it] 82%|████████▏ | 778/951 [14:22<02:55, 1.01s/it] 82%|████████▏ | 779/951 [14:23<02:54, 1.01s/it] 82%|████████▏ | 780/951 [14:24<02:53, 1.01s/it] {'loss': 1.2548, 'grad_norm': 0.6467399510691909, 'learning_rate': 1.9098300562505266e-06, 'epoch': 0.82} 82%|████████▏ | 780/951 [14:24<02:53, 1.01s/it] 82%|████████▏ | 781/951 [14:25<02:52, 1.02s/it] 82%|████████▏ | 782/951 [14:26<02:51, 1.02s/it] 82%|████████▏ | 783/951 [14:27<02:50, 1.02s/it] 82%|████████▏ | 784/951 [14:28<02:49, 1.02s/it] 83%|████████▎ | 785/951 [14:29<03:00, 1.09s/it] {'loss': 1.26, 'grad_norm': 0.6237075846859721, 'learning_rate': 1.8032141732190722e-06, 'epoch': 0.83} 83%|████████▎ | 785/951 [14:29<03:00, 1.09s/it] 83%|████████▎ | 786/951 [14:30<02:55, 1.07s/it] 83%|████████▎ | 787/951 [14:31<02:52, 1.05s/it] 83%|████████▎ | 788/951 [14:32<02:56, 1.08s/it] 83%|████████▎ | 789/951 [14:33<02:51, 1.06s/it] 83%|████████▎ | 790/951 [14:34<02:48, 1.05s/it] {'loss': 1.2605, 'grad_norm': 0.6374095389974708, 'learning_rate': 1.6993648425638797e-06, 'epoch': 0.83} 83%|████████▎ | 790/951 [14:34<02:48, 1.05s/it] 83%|████████▎ | 791/951 [14:35<02:46, 1.04s/it] 83%|████████▎ | 792/951 [14:36<02:43, 1.03s/it] 83%|████████▎ | 793/951 [14:37<02:42, 1.03s/it] 83%|████████▎ | 794/951 [14:38<02:40, 1.02s/it] 84%|████████▎ | 795/951 [14:39<02:38, 1.02s/it] {'loss': 1.2701, 'grad_norm': 0.6326077447290254, 'learning_rate': 1.5983171151717924e-06, 'epoch': 0.84} 84%|████████▎ | 795/951 [14:39<02:38, 1.02s/it] 84%|████████▎ | 796/951 [14:40<02:37, 1.02s/it] 84%|████████▍ | 797/951 [14:41<02:36, 1.02s/it] 84%|████████▍ | 798/951 [14:42<02:35, 1.01s/it] 84%|████████▍ | 799/951 [14:43<02:34, 1.01s/it] 84%|████████▍ | 800/951 [14:44<02:33, 1.01s/it] {'loss': 1.2427, 'grad_norm': 0.6299244980807773, 'learning_rate': 1.5001050963416718e-06, 'epoch': 0.84} 84%|████████▍ | 800/951 [14:44<02:33, 1.01s/it][INFO|trainer.py:3503] 2024-10-10 13:27:40,109 >> Saving model checkpoint to /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-800 [INFO|configuration_utils.py:472] 2024-10-10 13:27:40,120 >> Configuration saved in /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-800/config.json [INFO|configuration_utils.py:807] 2024-10-10 13:27:40,123 >> Configuration saved in /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-800/generation_config.json [INFO|modeling_utils.py:2799] 2024-10-10 13:27:43,577 >> Model weights saved in /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-800/model.safetensors [INFO|tokenization_utils_base.py:2684] 2024-10-10 13:27:43,583 >> tokenizer config file saved in /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-800/tokenizer_config.json [INFO|tokenization_utils_base.py:2693] 2024-10-10 13:27:43,586 >> Special tokens file saved in /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-800/special_tokens_map.json [2024-10-10 13:27:43,708] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint global_step800 is about to be saved! [2024-10-10 13:27:43,716] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-800/global_step800/zero_pp_rank_0_mp_rank_00_model_states.pt [2024-10-10 13:27:43,716] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-800/global_step800/zero_pp_rank_0_mp_rank_00_model_states.pt... [2024-10-10 13:27:43,739] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-800/global_step800/zero_pp_rank_0_mp_rank_00_model_states.pt. [2024-10-10 13:27:43,747] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-800/global_step800/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... [2024-10-10 13:27:46,605] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-800/global_step800/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. [2024-10-10 13:27:46,611] [INFO] [engine.py:3536:_save_zero_checkpoint] zero checkpoint saved /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-800/global_step800/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt [2024-10-10 13:27:47,025] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step800 is ready now! [INFO|trainer.py:3595] 2024-10-10 13:27:47,044 >> Deleting older checkpoint [/workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-700] due to args.save_total_limit /workspace/jihaozhe/miniconda3/envs/handbook/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined] 84%|████████▍ | 801/951 [14:55<09:50, 3.93s/it] 84%|████████▍ | 802/951 [14:56<07:35, 3.06s/it] 84%|████████▍ | 803/951 [14:57<06:01, 2.44s/it] 85%|████████▍ | 804/951 [14:58<04:56, 2.01s/it] 85%|████████▍ | 805/951 [14:59<04:09, 1.71s/it] {'loss': 1.2635, 'grad_norm': 0.6281934013360307, 'learning_rate': 1.404761934273291e-06, 'epoch': 0.85} 85%|████████▍ | 805/951 [14:59<04:09, 1.71s/it] 85%|████████▍ | 806/951 [15:00<03:37, 1.50s/it] 85%|████████▍ | 807/951 [15:01<03:15, 1.36s/it] 85%|████████▍ | 808/951 [15:02<02:59, 1.25s/it] 85%|████████▌ | 809/951 [15:03<02:47, 1.18s/it] 85%|████████▌ | 810/951 [15:04<02:39, 1.13s/it] {'loss': 1.2717, 'grad_norm': 0.6374294536071755, 'learning_rate': 1.3123198088792577e-06, 'epoch': 0.85} 85%|████████▌ | 810/951 [15:04<02:39, 1.13s/it] 85%|████████▌ | 811/951 [15:05<02:33, 1.10s/it] 85%|████████▌ | 812/951 [15:06<02:28, 1.07s/it] 85%|████████▌ | 813/951 [15:07<02:25, 1.05s/it] 86%|████████▌ | 814/951 [15:08<02:22, 1.04s/it] 86%|████████▌ | 815/951 [15:09<02:20, 1.03s/it] {'loss': 1.2499, 'grad_norm': 0.6298699771941692, 'learning_rate': 1.222809920923761e-06, 'epoch': 0.86} 86%|████████▌ | 815/951 [15:09<02:20, 1.03s/it] 86%|████████▌ | 816/951 [15:10<02:18, 1.03s/it] 86%|████████▌ | 817/951 [15:11<02:17, 1.02s/it] 86%|████████▌ | 818/951 [15:12<02:15, 1.02s/it] 86%|████████▌ | 819/951 [15:13<02:14, 1.02s/it] 86%|████████▌ | 820/951 [15:14<02:13, 1.02s/it] {'loss': 1.2552, 'grad_norm': 0.6359419744147026, 'learning_rate': 1.1362624814917843e-06, 'epoch': 0.86} 86%|████████▌ | 820/951 [15:14<02:13, 1.02s/it] 86%|████████▋ | 821/951 [15:15<02:12, 1.02s/it] 86%|████████▋ | 822/951 [15:16<02:10, 1.01s/it] 87%|████████▋ | 823/951 [15:17<02:09, 1.01s/it] 87%|████████▋ | 824/951 [15:18<02:08, 1.01s/it] 87%|████████▋ | 825/951 [15:19<02:07, 1.01s/it] {'loss': 1.2737, 'grad_norm': 0.6285813274381151, 'learning_rate': 1.0527067017923654e-06, 'epoch': 0.87} 87%|████████▋ | 825/951 [15:19<02:07, 1.01s/it] 87%|████████▋ | 826/951 [15:20<02:06, 1.02s/it] 87%|████████▋ | 827/951 [15:21<02:05, 1.01s/it] 87%|████████▋ | 828/951 [15:22<02:04, 1.01s/it] 87%|████████▋ | 829/951 [15:23<02:03, 1.01s/it] 87%|████████▋ | 830/951 [15:24<02:02, 1.01s/it] {'loss': 1.2663, 'grad_norm': 0.6210713562517961, 'learning_rate': 9.721707832993232e-07, 'epoch': 0.87} 87%|████████▋ | 830/951 [15:24<02:02, 1.01s/it] 87%|████████▋ | 831/951 [15:25<02:01, 1.01s/it] 87%|████████▋ | 832/951 [15:26<02:00, 1.01s/it] 88%|████████▊ | 833/951 [15:27<01:59, 1.01s/it] 88%|████████▊ | 834/951 [15:28<01:58, 1.01s/it] 88%|████████▊ | 835/951 [15:29<01:57, 1.01s/it] {'loss': 1.2581, 'grad_norm': 0.6645416418213763, 'learning_rate': 8.946819082327829e-07, 'epoch': 0.88} 88%|████████▊ | 835/951 [15:29<01:57, 1.01s/it] 88%|████████▊ | 836/951 [15:30<01:56, 1.01s/it] 88%|████████▊ | 837/951 [15:31<01:55, 1.01s/it] 88%|████████▊ | 838/951 [15:33<01:54, 1.01s/it] 88%|████████▊ | 839/951 [15:34<01:53, 1.01s/it] 88%|████████▊ | 840/951 [15:35<01:52, 1.01s/it] {'loss': 1.2519, 'grad_norm': 0.6251751866944497, 'learning_rate': 8.202662303847298e-07, 'epoch': 0.88} 88%|████████▊ | 840/951 [15:35<01:52, 1.01s/it] 88%|████████▊ | 841/951 [15:36<01:51, 1.01s/it] 89%|████████▊ | 842/951 [15:37<01:50, 1.01s/it] 89%|████████▊ | 843/951 [15:38<01:49, 1.01s/it] 89%|████████▊ | 844/951 [15:39<01:48, 1.01s/it] 89%|████████▉ | 845/951 [15:40<01:48, 1.02s/it] {'loss': 1.2614, 'grad_norm': 0.6431790207111582, 'learning_rate': 7.48948866291661e-07, 'epoch': 0.89} 89%|████████▉ | 845/951 [15:40<01:48, 1.02s/it] 89%|████████▉ | 846/951 [15:41<01:53, 1.08s/it] 89%|████████▉ | 847/951 [15:42<01:50, 1.06s/it] 89%|████████▉ | 848/951 [15:43<01:47, 1.05s/it] 89%|████████▉ | 849/951 [15:44<01:45, 1.04s/it] 89%|████████▉ | 850/951 [15:45<01:43, 1.03s/it] {'loss': 1.2722, 'grad_norm': 0.6384407054635708, 'learning_rate': 6.80753886757336e-07, 'epoch': 0.89} 89%|████████▉ | 850/951 [15:45<01:43, 1.03s/it] 89%|████████▉ | 851/951 [15:46<01:42, 1.02s/it] 90%|████████▉ | 852/951 [15:47<01:41, 1.02s/it] 90%|████████▉ | 853/951 [15:48<01:39, 1.02s/it] 90%|████████▉ | 854/951 [15:49<01:38, 1.02s/it] 90%|████████▉ | 855/951 [15:50<01:37, 1.02s/it] {'loss': 1.2587, 'grad_norm': 0.631430226459749, 'learning_rate': 6.157043087284797e-07, 'epoch': 0.9} 90%|████████▉ | 855/951 [15:50<01:37, 1.02s/it] 90%|█████████ | 856/951 [15:51<01:36, 1.02s/it] 90%|█████████ | 857/951 [15:52<01:35, 1.02s/it] 90%|█████████ | 858/951 [15:53<01:34, 1.02s/it] 90%|█████████ | 859/951 [15:54<01:33, 1.01s/it] 90%|█████████ | 860/951 [15:55<01:32, 1.01s/it] {'loss': 1.2481, 'grad_norm': 0.631843611779411, 'learning_rate': 5.538220875261736e-07, 'epoch': 0.9} 90%|█████████ | 860/951 [15:55<01:32, 1.01s/it] 91%|█████████ | 861/951 [15:56<01:31, 1.01s/it] 91%|█████████ | 862/951 [15:57<01:30, 1.01s/it] 91%|█████████ | 863/951 [15:58<01:29, 1.01s/it] 91%|█████████ | 864/951 [15:59<01:28, 1.01s/it] 91%|█████████ | 865/951 [16:00<01:27, 1.01s/it] {'loss': 1.2552, 'grad_norm': 0.6250494334708813, 'learning_rate': 4.951281094355708e-07, 'epoch': 0.91} 91%|█████████ | 865/951 [16:00<01:27, 1.01s/it] 91%|█████████ | 866/951 [16:01<01:26, 1.01s/it] 91%|█████████ | 867/951 [16:02<01:25, 1.01s/it] 91%|█████████▏| 868/951 [16:03<01:24, 1.01s/it] 91%|█████████▏| 869/951 [16:04<01:23, 1.01s/it] 91%|█████████▏| 870/951 [16:05<01:22, 1.01s/it] {'loss': 1.2536, 'grad_norm': 0.6300532606909163, 'learning_rate': 4.396421846564236e-07, 'epoch': 0.91} 91%|█████████▏| 870/951 [16:05<01:22, 1.01s/it] 92%|█████████▏| 871/951 [16:06<01:21, 1.01s/it] 92%|█████████▏| 872/951 [16:07<01:20, 1.01s/it] 92%|█████████▏| 873/951 [16:08<01:18, 1.01s/it] 92%|█████████▏| 874/951 [16:09<01:17, 1.01s/it] 92%|█████████▏| 875/951 [16:10<01:16, 1.01s/it] {'loss': 1.2694, 'grad_norm': 0.6200012582073006, 'learning_rate': 3.8738304061681107e-07, 'epoch': 0.92} 92%|█████████▏| 875/951 [16:10<01:16, 1.01s/it] 92%|█████████▏| 876/951 [16:11<01:15, 1.01s/it] 92%|█████████▏| 877/951 [16:12<01:14, 1.01s/it] 92%|█████████▏| 878/951 [16:13<01:14, 1.01s/it] 92%|█████████▏| 879/951 [16:14<01:13, 1.01s/it] 93%|█████████▎| 880/951 [16:15<01:11, 1.01s/it] {'loss': 1.2545, 'grad_norm': 0.6324403821172889, 'learning_rate': 3.3836831565231877e-07, 'epoch': 0.93} 93%|█████████▎| 880/951 [16:15<01:11, 1.01s/it] 93%|█████████▎| 881/951 [16:16<01:11, 1.02s/it] 93%|█████████▎| 882/951 [16:17<01:10, 1.02s/it] 93%|█████████▎| 883/951 [16:18<01:09, 1.02s/it] 93%|█████████▎| 884/951 [16:19<01:08, 1.01s/it] 93%|█████████▎| 885/951 [16:20<01:06, 1.01s/it] {'loss': 1.2531, 'grad_norm': 0.6325686488791483, 'learning_rate': 2.926145530528002e-07, 'epoch': 0.93} 93%|█████████▎| 885/951 [16:20<01:06, 1.01s/it] 93%|█████████▎| 886/951 [16:21<01:05, 1.02s/it] 93%|█████████▎| 887/951 [16:22<01:04, 1.02s/it] 93%|█████████▎| 888/951 [16:23<01:03, 1.01s/it] 93%|█████████▎| 889/951 [16:24<01:02, 1.01s/it] 94%|█████████▎| 890/951 [16:25<01:01, 1.01s/it] {'loss': 1.2648, 'grad_norm': 0.6373731914514896, 'learning_rate': 2.501371954787479e-07, 'epoch': 0.94} 94%|█████████▎| 890/951 [16:25<01:01, 1.01s/it] 94%|█████████▎| 891/951 [16:26<01:00, 1.02s/it] 94%|█████████▍| 892/951 [16:28<00:59, 1.02s/it] 94%|█████████▍| 893/951 [16:29<00:58, 1.01s/it] 94%|█████████▍| 894/951 [16:30<00:57, 1.01s/it] 94%|█████████▍| 895/951 [16:31<00:56, 1.01s/it] {'loss': 1.2392, 'grad_norm': 0.6255593953506046, 'learning_rate': 2.109505797491318e-07, 'epoch': 0.94} 94%|█████████▍| 895/951 [16:31<00:56, 1.01s/it] 94%|█████████▍| 896/951 [16:32<00:55, 1.02s/it] 94%|█████████▍| 897/951 [16:33<00:54, 1.01s/it] 94%|█████████▍| 898/951 [16:34<00:53, 1.01s/it] 95%|█████████▍| 899/951 [16:35<00:52, 1.01s/it] 95%|█████████▍| 900/951 [16:36<00:51, 1.01s/it] {'loss': 1.2633, 'grad_norm': 0.6353907042408413, 'learning_rate': 1.7506793200248507e-07, 'epoch': 0.95} 95%|█████████▍| 900/951 [16:36<00:51, 1.01s/it][INFO|trainer.py:3503] 2024-10-10 13:29:31,349 >> Saving model checkpoint to /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-900 [INFO|configuration_utils.py:472] 2024-10-10 13:29:31,363 >> Configuration saved in /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-900/config.json [INFO|configuration_utils.py:807] 2024-10-10 13:29:31,366 >> Configuration saved in /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-900/generation_config.json [INFO|modeling_utils.py:2799] 2024-10-10 13:29:33,547 >> Model weights saved in /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-900/model.safetensors [INFO|tokenization_utils_base.py:2684] 2024-10-10 13:29:33,553 >> tokenizer config file saved in /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-900/tokenizer_config.json [INFO|tokenization_utils_base.py:2693] 2024-10-10 13:29:33,556 >> Special tokens file saved in /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-900/special_tokens_map.json [2024-10-10 13:29:33,679] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint global_step900 is about to be saved! [2024-10-10 13:29:33,707] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-900/global_step900/zero_pp_rank_0_mp_rank_00_model_states.pt [2024-10-10 13:29:33,707] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-900/global_step900/zero_pp_rank_0_mp_rank_00_model_states.pt... [2024-10-10 13:29:33,718] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-900/global_step900/zero_pp_rank_0_mp_rank_00_model_states.pt. [2024-10-10 13:29:33,721] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-900/global_step900/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... [2024-10-10 13:29:36,792] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-900/global_step900/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. [2024-10-10 13:29:36,798] [INFO] [engine.py:3536:_save_zero_checkpoint] zero checkpoint saved /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-900/global_step900/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt [2024-10-10 13:29:36,807] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step900 is ready now! [INFO|trainer.py:3595] 2024-10-10 13:29:36,854 >> Deleting older checkpoint [/workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-800] due to args.save_total_limit /workspace/jihaozhe/miniconda3/envs/handbook/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined] 95%|█████████▍| 901/951 [16:45<03:02, 3.65s/it] 95%|█████████▍| 902/951 [16:46<02:20, 2.86s/it] 95%|█████████▍| 903/951 [16:47<01:50, 2.31s/it] 95%|█████████▌| 904/951 [16:48<01:30, 1.92s/it] 95%|█████████▌| 905/951 [16:49<01:15, 1.65s/it] {'loss': 1.2603, 'grad_norm': 0.6307207707434458, 'learning_rate': 1.4250136323285868e-07, 'epoch': 0.95} 95%|█████████▌| 905/951 [16:49<01:15, 1.65s/it] 95%|█████████▌| 906/951 [16:50<01:05, 1.46s/it] 95%|█████████▌| 907/951 [16:52<00:58, 1.32s/it] 95%|█████████▌| 908/951 [16:53<00:53, 1.23s/it] 96%|█████████▌| 909/951 [16:54<00:50, 1.21s/it] 96%|█████████▌| 910/951 [16:55<00:50, 1.23s/it] {'loss': 1.2453, 'grad_norm': 0.6277705175750871, 'learning_rate': 1.1326186520215888e-07, 'epoch': 0.96} 96%|█████████▌| 910/951 [16:55<00:50, 1.23s/it] 96%|█████████▌| 911/951 [16:56<00:46, 1.16s/it] 96%|█████████▌| 912/951 [16:57<00:43, 1.12s/it] 96%|█████████▌| 913/951 [16:58<00:41, 1.09s/it] 96%|█████████▌| 914/951 [16:59<00:39, 1.07s/it] 96%|█████████▌| 915/951 [17:00<00:37, 1.05s/it] {'loss': 1.2544, 'grad_norm': 0.6031775390358434, 'learning_rate': 8.735930673024806e-08, 'epoch': 0.96} 96%|█████████▌| 915/951 [17:00<00:37, 1.05s/it] 96%|█████████▋| 916/951 [17:01<00:36, 1.04s/it] 96%|█████████▋| 917/951 [17:02<00:35, 1.03s/it] 97%|█████████▋| 918/951 [17:03<00:33, 1.03s/it] 97%|█████████▋| 919/951 [17:04<00:32, 1.02s/it] 97%|█████████▋| 920/951 [17:05<00:31, 1.02s/it] {'loss': 1.2661, 'grad_norm': 0.6189714576376613, 'learning_rate': 6.480243036404598e-08, 'epoch': 0.97} 97%|█████████▋| 920/951 [17:05<00:31, 1.02s/it] 97%|█████████▋| 921/951 [17:06<00:30, 1.02s/it] 97%|█████████▋| 922/951 [17:07<00:29, 1.02s/it] 97%|█████████▋| 923/951 [17:08<00:28, 1.02s/it] 97%|█████████▋| 924/951 [17:09<00:27, 1.02s/it] 97%|█████████▋| 925/951 [17:10<00:26, 1.02s/it] {'loss': 1.2418, 'grad_norm': 0.6219922382476483, 'learning_rate': 4.5598849426777833e-08, 'epoch': 0.97} 97%|█████████▋| 925/951 [17:10<00:26, 1.02s/it] 97%|█████████▋| 926/951 [17:11<00:25, 1.02s/it] 97%|█████████▋| 927/951 [17:12<00:24, 1.01s/it] 98%|█████████▊| 928/951 [17:13<00:23, 1.01s/it] 98%|█████████▊| 929/951 [17:14<00:22, 1.01s/it] 98%|█████████▊| 930/951 [17:15<00:21, 1.02s/it] {'loss': 1.2471, 'grad_norm': 0.6188912441406357, 'learning_rate': 2.9755045448351948e-08, 'epoch': 0.98} 98%|█████████▊| 930/951 [17:15<00:21, 1.02s/it] 98%|█████████▊| 931/951 [17:16<00:20, 1.02s/it] 98%|█████████▊| 932/951 [17:17<00:19, 1.02s/it] 98%|█████████▊| 933/951 [17:18<00:18, 1.02s/it] 98%|█████████▊| 934/951 [17:19<00:17, 1.02s/it] 98%|█████████▊| 935/951 [17:20<00:16, 1.01s/it] {'loss': 1.2591, 'grad_norm': 0.6241442557644146, 'learning_rate': 1.7276365977730858e-08, 'epoch': 0.98} 98%|█████████▊| 935/951 [17:20<00:16, 1.01s/it] 98%|█████████▊| 936/951 [17:21<00:15, 1.02s/it] 99%|█████████▊| 937/951 [17:22<00:14, 1.02s/it] 99%|█████████▊| 938/951 [17:23<00:13, 1.02s/it] 99%|█████████▊| 939/951 [17:24<00:12, 1.02s/it] 99%|█████████▉| 940/951 [17:25<00:11, 1.02s/it] {'loss': 1.2783, 'grad_norm': 0.6148942917091785, 'learning_rate': 8.167022778045042e-09, 'epoch': 0.99} 99%|█████████▉| 940/951 [17:25<00:11, 1.02s/it] 99%|█████████▉| 941/951 [17:26<00:10, 1.02s/it] 99%|█████████▉| 942/951 [17:27<00:09, 1.02s/it] 99%|█████████▉| 943/951 [17:28<00:08, 1.02s/it] 99%|█████████▉| 944/951 [17:29<00:07, 1.02s/it] 99%|█████████▉| 945/951 [17:31<00:06, 1.02s/it] {'loss': 1.2311, 'grad_norm': 0.6142845224019251, 'learning_rate': 2.430090405054486e-09, 'epoch': 0.99} 99%|█████████▉| 945/951 [17:31<00:06, 1.02s/it] 99%|█████████▉| 946/951 [17:32<00:05, 1.02s/it] 100%|█████████▉| 947/951 [17:33<00:04, 1.02s/it] 100%|█████████▉| 948/951 [17:34<00:03, 1.02s/it] 100%|█████████▉| 949/951 [17:35<00:02, 1.02s/it] 100%|█████████▉| 950/951 [17:36<00:01, 1.02s/it] {'loss': 1.2488, 'grad_norm': 0.6318778333420784, 'learning_rate': 6.750516943321295e-11, 'epoch': 1.0} 100%|█████████▉| 950/951 [17:36<00:01, 1.02s/it] 100%|██████████| 951/951 [17:37<00:00, 1.01s/it][INFO|trainer.py:3503] 2024-10-10 13:30:32,410 >> Saving model checkpoint to /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-951 [INFO|configuration_utils.py:472] 2024-10-10 13:30:32,420 >> Configuration saved in /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-951/config.json [INFO|configuration_utils.py:807] 2024-10-10 13:30:32,423 >> Configuration saved in /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-951/generation_config.json [INFO|modeling_utils.py:2799] 2024-10-10 13:30:34,599 >> Model weights saved in /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-951/model.safetensors [INFO|tokenization_utils_base.py:2684] 2024-10-10 13:30:34,604 >> tokenizer config file saved in /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-951/tokenizer_config.json [INFO|tokenization_utils_base.py:2693] 2024-10-10 13:30:34,607 >> Special tokens file saved in /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-951/special_tokens_map.json [2024-10-10 13:30:34,734] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint global_step951 is about to be saved! [2024-10-10 13:30:34,755] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-951/global_step951/zero_pp_rank_0_mp_rank_00_model_states.pt [2024-10-10 13:30:34,755] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-951/global_step951/zero_pp_rank_0_mp_rank_00_model_states.pt... [2024-10-10 13:30:34,764] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-951/global_step951/zero_pp_rank_0_mp_rank_00_model_states.pt. [2024-10-10 13:30:34,775] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-951/global_step951/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... [2024-10-10 13:30:37,782] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-951/global_step951/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. [2024-10-10 13:30:37,789] [INFO] [engine.py:3536:_save_zero_checkpoint] zero checkpoint saved /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-951/global_step951/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt [2024-10-10 13:30:38,115] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step951 is ready now! [INFO|trainer.py:3595] 2024-10-10 13:30:38,135 >> Deleting older checkpoint [/workspace/jihaozhe/models/Llama-3.2-1B-sft-full/checkpoint-900] due to args.save_total_limit [INFO|trainer.py:3819] 2024-10-10 13:30:40,431 >> ***** Running Evaluation ***** [INFO|trainer.py:3821] 2024-10-10 13:30:40,431 >> Num examples = 13458 [INFO|trainer.py:3824] 2024-10-10 13:30:40,431 >> Batch size = 8 0%| | 0/211 [00:00> Training completed. Do not forget to share your model on huggingface.co/models =) {'train_runtime': 1098.6048, 'train_samples_per_second': 110.718, 'train_steps_per_second': 0.866, 'train_loss': 1.289618753484371, 'epoch': 1.0} 100%|██████████| 951/951 [18:18<00:00, 1.01s/it] 100%|██████████| 951/951 [18:18<00:00, 1.16s/it] ***** train metrics ***** epoch = 1.0 total_flos = 94149GF train_loss = 1.2896 train_runtime = 0:18:18.60 train_samples = 207865 train_samples_per_second = 110.718 train_steps_per_second = 0.866 2024-10-10 13:31:13 - INFO - __main__ - *** Save model *** [INFO|trainer.py:3503] 2024-10-10 13:31:13,934 >> Saving model checkpoint to /workspace/jihaozhe/models/Llama-3.2-1B-sft-full [INFO|configuration_utils.py:472] 2024-10-10 13:31:13,940 >> Configuration saved in /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/config.json [INFO|configuration_utils.py:807] 2024-10-10 13:31:13,943 >> Configuration saved in /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/generation_config.json [INFO|modeling_utils.py:2799] 2024-10-10 13:31:16,123 >> Model weights saved in /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/model.safetensors [INFO|tokenization_utils_base.py:2684] 2024-10-10 13:31:16,129 >> tokenizer config file saved in /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/tokenizer_config.json [INFO|tokenization_utils_base.py:2693] 2024-10-10 13:31:16,132 >> Special tokens file saved in /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/special_tokens_map.json 2024-10-10 13:31:16 - INFO - __main__ - Model saved to /workspace/jihaozhe/models/Llama-3.2-1B-sft-full [INFO|modelcard.py:449] 2024-10-10 13:31:16,253 >> Dropping the following result as it does not have all the necessary fields: {'task': {'name': 'Causal Language Modeling', 'type': 'text-generation'}, 'dataset': {'name': '/workspace/jihaozhe/data/ultrachat_200k', 'type': '/workspace/jihaozhe/data/ultrachat_200k', 'config': 'default', 'split': 'train', 'args': 'default'}} [INFO|configuration_utils.py:472] 2024-10-10 13:31:16,259 >> Configuration saved in /workspace/jihaozhe/models/Llama-3.2-1B-sft-full/config.json 2024-10-10 13:31:16 - INFO - __main__ - *** Evaluate *** [INFO|trainer.py:3819] 2024-10-10 13:31:16,262 >> ***** Running Evaluation ***** [INFO|trainer.py:3821] 2024-10-10 13:31:16,262 >> Num examples = 13458 [INFO|trainer.py:3824] 2024-10-10 13:31:16,262 >> Batch size = 8 [WARNING|logging.py:328] 2024-10-10 13:31:16,291 >> We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache) 0%| | 0/211 [00:00