07/16/2024 09:47:54 - INFO - llamafactory.hparams.parser - Process rank: 4, device: cuda:4, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16

[INFO|parser.py:325] 2024-07-16 09:47:54,375 >> Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16

07/16/2024 09:47:54 - INFO - llamafactory.hparams.parser - Process rank: 5, device: cuda:5, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16

07/16/2024 09:47:54 - INFO - llamafactory.hparams.parser - Process rank: 2, device: cuda:2, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16

07/16/2024 09:47:54 - INFO - llamafactory.hparams.parser - Process rank: 3, device: cuda:3, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16

07/16/2024 09:47:54 - INFO - llamafactory.hparams.parser - Process rank: 1, device: cuda:1, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16

07/16/2024 09:47:54 - INFO - llamafactory.hparams.parser - Process rank: 6, device: cuda:6, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16

07/16/2024 09:47:54 - INFO - llamafactory.hparams.parser - Process rank: 7, device: cuda:7, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16

[INFO|tokenization_utils_base.py:2161] 2024-07-16 09:47:54,671 >> loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e1945c40cd546c78e41f1151f4db032b271faeaa/tokenizer.json

[INFO|tokenization_utils_base.py:2161] 2024-07-16 09:47:54,671 >> loading file added_tokens.json from cache at None

[INFO|tokenization_utils_base.py:2161] 2024-07-16 09:47:54,671 >> loading file special_tokens_map.json from cache at /root/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e1945c40cd546c78e41f1151f4db032b271faeaa/special_tokens_map.json

[INFO|tokenization_utils_base.py:2161] 2024-07-16 09:47:54,672 >> loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e1945c40cd546c78e41f1151f4db032b271faeaa/tokenizer_config.json

[WARNING|logging.py:313] 2024-07-16 09:47:54,959 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

[INFO|template.py:270] 2024-07-16 09:47:54,960 >> Replace eos token: <|eot_id|>

[INFO|template.py:372] 2024-07-16 09:47:54,960 >> Add pad token: <|eot_id|>

[INFO|loader.py:50] 2024-07-16 09:47:54,960 >> Loading dataset 0716_truthfulqa_benchmark_train.json...

07/16/2024 09:47:54 - WARNING - transformers.tokenization_utils_base - Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

07/16/2024 09:47:54 - INFO - llamafactory.data.template - Replace eos token: <|eot_id|>

07/16/2024 09:47:54 - INFO - llamafactory.data.template - Add pad token: <|eot_id|>

07/16/2024 09:47:54 - WARNING - transformers.tokenization_utils_base - Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

07/16/2024 09:47:54 - INFO - llamafactory.data.template - Replace eos token: <|eot_id|>

07/16/2024 09:47:54 - INFO - llamafactory.data.template - Add pad token: <|eot_id|>

07/16/2024 09:47:54 - WARNING - transformers.tokenization_utils_base - Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

07/16/2024 09:47:54 - INFO - llamafactory.data.template - Replace eos token: <|eot_id|>

07/16/2024 09:47:54 - INFO - llamafactory.data.template - Add pad token: <|eot_id|>

07/16/2024 09:47:54 - WARNING - transformers.tokenization_utils_base - Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

07/16/2024 09:47:54 - INFO - llamafactory.data.template - Replace eos token: <|eot_id|>

07/16/2024 09:47:54 - INFO - llamafactory.data.template - Add pad token: <|eot_id|>

07/16/2024 09:47:54 - WARNING - transformers.tokenization_utils_base - Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

07/16/2024 09:47:54 - INFO - llamafactory.data.template - Replace eos token: <|eot_id|>

07/16/2024 09:47:54 - INFO - llamafactory.data.template - Add pad token: <|eot_id|>

07/16/2024 09:47:54 - WARNING - transformers.tokenization_utils_base - Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

07/16/2024 09:47:54 - INFO - llamafactory.data.template - Replace eos token: <|eot_id|>

07/16/2024 09:47:54 - INFO - llamafactory.data.template - Add pad token: <|eot_id|>

07/16/2024 09:47:55 - WARNING - transformers.tokenization_utils_base - Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

07/16/2024 09:47:55 - INFO - llamafactory.data.template - Replace eos token: <|eot_id|>

07/16/2024 09:47:55 - INFO - llamafactory.data.template - Add pad token: <|eot_id|>

07/16/2024 09:47:56 - INFO - llamafactory.data.loader - Loading dataset 0716_truthfulqa_benchmark_train.json...

07/16/2024 09:47:56 - INFO - llamafactory.data.loader - Loading dataset 0716_truthfulqa_benchmark_train.json...

07/16/2024 09:47:56 - INFO - llamafactory.data.loader - Loading dataset 0716_truthfulqa_benchmark_train.json...

07/16/2024 09:47:56 - INFO - llamafactory.data.loader - Loading dataset 0716_truthfulqa_benchmark_train.json...

07/16/2024 09:47:56 - INFO - llamafactory.data.loader - Loading dataset 0716_truthfulqa_benchmark_train.json...

07/16/2024 09:47:56 - INFO - llamafactory.data.loader - Loading dataset 0716_truthfulqa_benchmark_train.json...

07/16/2024 09:47:56 - INFO - llamafactory.data.loader - Loading dataset 0716_truthfulqa_benchmark_train.json...

[INFO|configuration_utils.py:733] 2024-07-16 09:48:00,277 >> loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e1945c40cd546c78e41f1151f4db032b271faeaa/config.json

[INFO|configuration_utils.py:800] 2024-07-16 09:48:00,280 >> Model config LlamaConfig {
  "_name_or_path": "meta-llama/Meta-Llama-3-8B-Instruct",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128009,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 8192,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.42.3",
  "use_cache": true,
  "vocab_size": 128256
}


[INFO|modeling_utils.py:3556] 2024-07-16 09:48:00,330 >> loading weights file model.safetensors from cache at /root/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e1945c40cd546c78e41f1151f4db032b271faeaa/model.safetensors.index.json

[INFO|modeling_utils.py:1531] 2024-07-16 09:48:00,332 >> Instantiating LlamaForCausalLM model under default dtype torch.bfloat16.

[INFO|configuration_utils.py:1000] 2024-07-16 09:48:00,334 >> Generate config GenerationConfig {
  "bos_token_id": 128000,
  "eos_token_id": 128009
}


[INFO|modeling_utils.py:4364] 2024-07-16 09:48:04,157 >> All model checkpoint weights were used when initializing LlamaForCausalLM.


[INFO|modeling_utils.py:4372] 2024-07-16 09:48:04,157 >> All the weights of LlamaForCausalLM were initialized from the model checkpoint at meta-llama/Meta-Llama-3-8B-Instruct.
If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training.

[INFO|configuration_utils.py:955] 2024-07-16 09:48:04,331 >> loading configuration file generation_config.json from cache at /root/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e1945c40cd546c78e41f1151f4db032b271faeaa/generation_config.json

[INFO|configuration_utils.py:1000] 2024-07-16 09:48:04,332 >> Generate config GenerationConfig {
  "bos_token_id": 128000,
  "do_sample": true,
  "eos_token_id": [
    128001,
    128009
  ],
  "max_length": 4096,
  "temperature": 0.6,
  "top_p": 0.9
}


[INFO|checkpointing.py:103] 2024-07-16 09:48:04,339 >> Gradient checkpointing enabled.

[INFO|attention.py:80] 2024-07-16 09:48:04,339 >> Using torch SDPA for faster training and inference.

[INFO|adapter.py:302] 2024-07-16 09:48:04,339 >> Upcasting trainable params to float32.

[INFO|adapter.py:48] 2024-07-16 09:48:04,339 >> Fine-tuning method: Full

[INFO|loader.py:196] 2024-07-16 09:48:04,384 >> trainable params: 8,030,261,248 || all params: 8,030,261,248 || trainable%: 100.0000

07/16/2024 09:48:04 - INFO - llamafactory.model.model_utils.checkpointing - Gradient checkpointing enabled.

07/16/2024 09:48:04 - INFO - llamafactory.model.model_utils.attention - Using torch SDPA for faster training and inference.

07/16/2024 09:48:04 - INFO - llamafactory.model.adapter - Upcasting trainable params to float32.

07/16/2024 09:48:04 - INFO - llamafactory.model.adapter - Fine-tuning method: Full

07/16/2024 09:48:04 - INFO - llamafactory.model.model_utils.checkpointing - Gradient checkpointing enabled.

07/16/2024 09:48:04 - INFO - llamafactory.model.model_utils.attention - Using torch SDPA for faster training and inference.

07/16/2024 09:48:04 - INFO - llamafactory.model.adapter - Upcasting trainable params to float32.

07/16/2024 09:48:04 - INFO - llamafactory.model.adapter - Fine-tuning method: Full

07/16/2024 09:48:04 - INFO - llamafactory.model.model_utils.checkpointing - Gradient checkpointing enabled.

07/16/2024 09:48:04 - INFO - llamafactory.model.model_utils.attention - Using torch SDPA for faster training and inference.

07/16/2024 09:48:04 - INFO - llamafactory.model.adapter - Upcasting trainable params to float32.

07/16/2024 09:48:04 - INFO - llamafactory.model.adapter - Fine-tuning method: Full

07/16/2024 09:48:04 - INFO - llamafactory.model.model_utils.checkpointing - Gradient checkpointing enabled.

07/16/2024 09:48:04 - INFO - llamafactory.model.model_utils.attention - Using torch SDPA for faster training and inference.

07/16/2024 09:48:04 - INFO - llamafactory.model.adapter - Upcasting trainable params to float32.

07/16/2024 09:48:04 - INFO - llamafactory.model.adapter - Fine-tuning method: Full

07/16/2024 09:48:04 - INFO - llamafactory.model.loader - trainable params: 8,030,261,248 || all params: 8,030,261,248 || trainable%: 100.0000

07/16/2024 09:48:04 - INFO - llamafactory.model.model_utils.checkpointing - Gradient checkpointing enabled.

07/16/2024 09:48:04 - INFO - llamafactory.model.model_utils.attention - Using torch SDPA for faster training and inference.

07/16/2024 09:48:04 - INFO - llamafactory.model.adapter - Upcasting trainable params to float32.

07/16/2024 09:48:04 - INFO - llamafactory.model.adapter - Fine-tuning method: Full

07/16/2024 09:48:04 - INFO - llamafactory.model.model_utils.checkpointing - Gradient checkpointing enabled.

07/16/2024 09:48:04 - INFO - llamafactory.model.model_utils.attention - Using torch SDPA for faster training and inference.

07/16/2024 09:48:04 - INFO - llamafactory.model.adapter - Upcasting trainable params to float32.

07/16/2024 09:48:04 - INFO - llamafactory.model.adapter - Fine-tuning method: Full

07/16/2024 09:48:04 - INFO - llamafactory.model.loader - trainable params: 8,030,261,248 || all params: 8,030,261,248 || trainable%: 100.0000

07/16/2024 09:48:04 - INFO - llamafactory.model.loader - trainable params: 8,030,261,248 || all params: 8,030,261,248 || trainable%: 100.0000

07/16/2024 09:48:04 - INFO - llamafactory.model.loader - trainable params: 8,030,261,248 || all params: 8,030,261,248 || trainable%: 100.0000

07/16/2024 09:48:04 - INFO - llamafactory.model.loader - trainable params: 8,030,261,248 || all params: 8,030,261,248 || trainable%: 100.0000

07/16/2024 09:48:04 - INFO - llamafactory.model.loader - trainable params: 8,030,261,248 || all params: 8,030,261,248 || trainable%: 100.0000

[INFO|trainer.py:642] 2024-07-16 09:48:04,390 >> Using auto half precision backend

07/16/2024 09:48:05 - INFO - llamafactory.model.model_utils.checkpointing - Gradient checkpointing enabled.

07/16/2024 09:48:05 - INFO - llamafactory.model.model_utils.attention - Using torch SDPA for faster training and inference.

07/16/2024 09:48:05 - INFO - llamafactory.model.adapter - Upcasting trainable params to float32.

07/16/2024 09:48:05 - INFO - llamafactory.model.adapter - Fine-tuning method: Full

07/16/2024 09:48:05 - INFO - llamafactory.model.loader - trainable params: 8,030,261,248 || all params: 8,030,261,248 || trainable%: 100.0000

[INFO|trainer.py:2128] 2024-07-16 09:48:27,728 >> ***** Running training *****

[INFO|trainer.py:2129] 2024-07-16 09:48:27,728 >>   Num examples = 4,968

[INFO|trainer.py:2130] 2024-07-16 09:48:27,728 >>   Num Epochs = 5

[INFO|trainer.py:2131] 2024-07-16 09:48:27,729 >>   Instantaneous batch size per device = 2

[INFO|trainer.py:2134] 2024-07-16 09:48:27,729 >>   Total train batch size (w. parallel, distributed & accumulation) = 128

[INFO|trainer.py:2135] 2024-07-16 09:48:27,729 >>   Gradient Accumulation steps = 8

[INFO|trainer.py:2136] 2024-07-16 09:48:27,729 >>   Total optimization steps = 190

[INFO|trainer.py:2137] 2024-07-16 09:48:27,730 >>   Number of trainable parameters = 8,030,261,248

[INFO|callbacks.py:310] 2024-07-16 09:48:41,429 >> {'loss': 14.1364, 'learning_rate': 5.0000e-07, 'epoch': 0.03, 'throughput': 475.41}

[INFO|callbacks.py:310] 2024-07-16 09:48:54,626 >> {'loss': 13.7804, 'learning_rate': 1.0000e-06, 'epoch': 0.05, 'throughput': 477.72}

[INFO|callbacks.py:310] 2024-07-16 09:49:07,808 >> {'loss': 13.4871, 'learning_rate': 1.5000e-06, 'epoch': 0.08, 'throughput': 480.28}

[INFO|callbacks.py:310] 2024-07-16 09:49:20,995 >> {'loss': 12.7900, 'learning_rate': 2.0000e-06, 'epoch': 0.10, 'throughput': 475.82}

[INFO|callbacks.py:310] 2024-07-16 09:49:34,157 >> {'loss': 9.2748, 'learning_rate': 2.5000e-06, 'epoch': 0.13, 'throughput': 481.26}

[INFO|callbacks.py:310] 2024-07-16 09:49:47,321 >> {'loss': 6.5585, 'learning_rate': 3.0000e-06, 'epoch': 0.15, 'throughput': 479.46}

[INFO|callbacks.py:310] 2024-07-16 09:50:00,485 >> {'loss': 5.3984, 'learning_rate': 3.5000e-06, 'epoch': 0.18, 'throughput': 475.92}

[INFO|callbacks.py:310] 2024-07-16 09:50:13,669 >> {'loss': 1.9363, 'learning_rate': 4.0000e-06, 'epoch': 0.21, 'throughput': 476.05}

[INFO|callbacks.py:310] 2024-07-16 09:50:26,844 >> {'loss': 0.6783, 'learning_rate': 4.5000e-06, 'epoch': 0.23, 'throughput': 477.93}

[INFO|callbacks.py:310] 2024-07-16 09:50:40,003 >> {'loss': 2.9945, 'learning_rate': 5.0000e-06, 'epoch': 0.26, 'throughput': 478.89}

[INFO|callbacks.py:310] 2024-07-16 09:50:53,164 >> {'loss': 0.2916, 'learning_rate': 4.9996e-06, 'epoch': 0.28, 'throughput': 478.35}

[INFO|callbacks.py:310] 2024-07-16 09:51:06,343 >> {'loss': 2.2775, 'learning_rate': 4.9985e-06, 'epoch': 0.31, 'throughput': 478.05}

[INFO|callbacks.py:310] 2024-07-16 09:51:19,511 >> {'loss': 0.3757, 'learning_rate': 4.9966e-06, 'epoch': 0.33, 'throughput': 478.29}

[INFO|callbacks.py:310] 2024-07-16 09:51:32,674 >> {'loss': 1.9543, 'learning_rate': 4.9939e-06, 'epoch': 0.36, 'throughput': 479.20}

[INFO|callbacks.py:310] 2024-07-16 09:51:45,855 >> {'loss': 0.7398, 'learning_rate': 4.9905e-06, 'epoch': 0.39, 'throughput': 478.49}

[INFO|callbacks.py:310] 2024-07-16 09:51:59,041 >> {'loss': 1.1868, 'learning_rate': 4.9863e-06, 'epoch': 0.41, 'throughput': 479.67}

[INFO|callbacks.py:310] 2024-07-16 09:52:12,210 >> {'loss': 0.5418, 'learning_rate': 4.9814e-06, 'epoch': 0.44, 'throughput': 478.83}

[INFO|callbacks.py:310] 2024-07-16 09:52:25,377 >> {'loss': 0.2263, 'learning_rate': 4.9757e-06, 'epoch': 0.46, 'throughput': 479.37}

[INFO|callbacks.py:310] 2024-07-16 09:52:38,537 >> {'loss': 0.1612, 'learning_rate': 4.9692e-06, 'epoch': 0.49, 'throughput': 479.86}

[INFO|callbacks.py:310] 2024-07-16 09:52:51,713 >> {'loss': 0.3299, 'learning_rate': 4.9620e-06, 'epoch': 0.51, 'throughput': 480.64}

[INFO|callbacks.py:310] 2024-07-16 09:53:04,885 >> {'loss': 0.2013, 'learning_rate': 4.9541e-06, 'epoch': 0.54, 'throughput': 481.12}

[INFO|callbacks.py:310] 2024-07-16 09:53:18,042 >> {'loss': 0.2446, 'learning_rate': 4.9454e-06, 'epoch': 0.57, 'throughput': 481.36}

[INFO|callbacks.py:310] 2024-07-16 09:53:31,215 >> {'loss': 0.2235, 'learning_rate': 4.9359e-06, 'epoch': 0.59, 'throughput': 481.55}

[INFO|callbacks.py:310] 2024-07-16 09:53:44,385 >> {'loss': 0.1160, 'learning_rate': 4.9257e-06, 'epoch': 0.62, 'throughput': 480.78}

[INFO|callbacks.py:310] 2024-07-16 09:53:57,544 >> {'loss': 0.2179, 'learning_rate': 4.9148e-06, 'epoch': 0.64, 'throughput': 480.61}

[INFO|callbacks.py:310] 2024-07-16 09:54:10,712 >> {'loss': 0.1414, 'learning_rate': 4.9032e-06, 'epoch': 0.67, 'throughput': 480.45}

[INFO|callbacks.py:310] 2024-07-16 09:54:23,865 >> {'loss': 0.1181, 'learning_rate': 4.8908e-06, 'epoch': 0.69, 'throughput': 481.03}

[INFO|callbacks.py:310] 2024-07-16 09:54:37,030 >> {'loss': 0.2753, 'learning_rate': 4.8776e-06, 'epoch': 0.72, 'throughput': 481.61}

[INFO|callbacks.py:310] 2024-07-16 09:54:50,171 >> {'loss': 0.3255, 'learning_rate': 4.8638e-06, 'epoch': 0.75, 'throughput': 482.08}

[INFO|callbacks.py:310] 2024-07-16 09:55:03,354 >> {'loss': 0.2352, 'learning_rate': 4.8492e-06, 'epoch': 0.77, 'throughput': 482.60}

[INFO|callbacks.py:310] 2024-07-16 09:55:16,515 >> {'loss': 0.0630, 'learning_rate': 4.8340e-06, 'epoch': 0.80, 'throughput': 482.48}

[INFO|callbacks.py:310] 2024-07-16 09:55:29,668 >> {'loss': 0.2042, 'learning_rate': 4.8180e-06, 'epoch': 0.82, 'throughput': 482.50}

[INFO|callbacks.py:310] 2024-07-16 09:55:42,841 >> {'loss': 0.1364, 'learning_rate': 4.8013e-06, 'epoch': 0.85, 'throughput': 482.93}

[INFO|callbacks.py:310] 2024-07-16 09:55:56,012 >> {'loss': 0.0934, 'learning_rate': 4.7839e-06, 'epoch': 0.87, 'throughput': 482.98}

[INFO|callbacks.py:310] 2024-07-16 09:56:09,170 >> {'loss': 0.1332, 'learning_rate': 4.7658e-06, 'epoch': 0.90, 'throughput': 483.11}

[INFO|callbacks.py:310] 2024-07-16 09:56:22,332 >> {'loss': 0.1595, 'learning_rate': 4.7470e-06, 'epoch': 0.93, 'throughput': 483.00}

[INFO|callbacks.py:310] 2024-07-16 09:56:35,503 >> {'loss': 0.1528, 'learning_rate': 4.7275e-06, 'epoch': 0.95, 'throughput': 483.18}

[INFO|callbacks.py:310] 2024-07-16 09:56:48,669 >> {'loss': 0.1342, 'learning_rate': 4.7074e-06, 'epoch': 0.98, 'throughput': 483.48}

[INFO|callbacks.py:310] 2024-07-16 09:57:01,819 >> {'loss': 0.1586, 'learning_rate': 4.6865e-06, 'epoch': 1.00, 'throughput': 483.71}

[INFO|callbacks.py:310] 2024-07-16 09:57:14,986 >> {'loss': 0.1072, 'learning_rate': 4.6651e-06, 'epoch': 1.03, 'throughput': 483.77}

[INFO|callbacks.py:310] 2024-07-16 09:57:28,147 >> {'loss': 0.0357, 'learning_rate': 4.6429e-06, 'epoch': 1.05, 'throughput': 484.04}

[INFO|callbacks.py:310] 2024-07-16 09:57:41,316 >> {'loss': 0.0600, 'learning_rate': 4.6201e-06, 'epoch': 1.08, 'throughput': 484.18}

[INFO|callbacks.py:310] 2024-07-16 09:57:54,470 >> {'loss': 0.0902, 'learning_rate': 4.5967e-06, 'epoch': 1.11, 'throughput': 484.46}

[INFO|callbacks.py:310] 2024-07-16 09:58:07,621 >> {'loss': 0.0202, 'learning_rate': 4.5726e-06, 'epoch': 1.13, 'throughput': 484.51}

[INFO|callbacks.py:310] 2024-07-16 09:58:20,803 >> {'loss': 0.0380, 'learning_rate': 4.5479e-06, 'epoch': 1.16, 'throughput': 484.10}

[INFO|callbacks.py:310] 2024-07-16 09:58:33,969 >> {'loss': 0.0379, 'learning_rate': 4.5225e-06, 'epoch': 1.18, 'throughput': 484.17}

[INFO|callbacks.py:310] 2024-07-16 09:58:47,129 >> {'loss': 0.0742, 'learning_rate': 4.4966e-06, 'epoch': 1.21, 'throughput': 484.24}

[INFO|callbacks.py:310] 2024-07-16 09:59:00,303 >> {'loss': 0.0658, 'learning_rate': 4.4700e-06, 'epoch': 1.23, 'throughput': 483.64}

[INFO|callbacks.py:310] 2024-07-16 09:59:13,461 >> {'loss': 0.0336, 'learning_rate': 4.4429e-06, 'epoch': 1.26, 'throughput': 483.99}

[INFO|callbacks.py:310] 2024-07-16 09:59:26,622 >> {'loss': 0.1021, 'learning_rate': 4.4151e-06, 'epoch': 1.29, 'throughput': 483.77}

[INFO|callbacks.py:310] 2024-07-16 09:59:39,766 >> {'loss': 0.1312, 'learning_rate': 4.3868e-06, 'epoch': 1.31, 'throughput': 483.74}

[INFO|callbacks.py:310] 2024-07-16 09:59:52,949 >> {'loss': 0.0665, 'learning_rate': 4.3579e-06, 'epoch': 1.34, 'throughput': 483.68}

[INFO|callbacks.py:310] 2024-07-16 10:00:06,104 >> {'loss': 0.0679, 'learning_rate': 4.3284e-06, 'epoch': 1.36, 'throughput': 483.66}

[INFO|callbacks.py:310] 2024-07-16 10:00:19,266 >> {'loss': 0.0579, 'learning_rate': 4.2983e-06, 'epoch': 1.39, 'throughput': 483.46}

[INFO|callbacks.py:310] 2024-07-16 10:00:32,433 >> {'loss': 0.0542, 'learning_rate': 4.2678e-06, 'epoch': 1.41, 'throughput': 483.69}

[INFO|callbacks.py:310] 2024-07-16 10:00:45,598 >> {'loss': 0.0476, 'learning_rate': 4.2366e-06, 'epoch': 1.44, 'throughput': 483.69}

[INFO|callbacks.py:310] 2024-07-16 10:00:58,749 >> {'loss': 0.0613, 'learning_rate': 4.2050e-06, 'epoch': 1.47, 'throughput': 483.84}

[INFO|callbacks.py:310] 2024-07-16 10:01:11,904 >> {'loss': 0.0995, 'learning_rate': 4.1728e-06, 'epoch': 1.49, 'throughput': 483.76}

[INFO|callbacks.py:310] 2024-07-16 10:01:25,086 >> {'loss': 0.0532, 'learning_rate': 4.1401e-06, 'epoch': 1.52, 'throughput': 483.57}

[INFO|callbacks.py:310] 2024-07-16 10:01:38,265 >> {'loss': 0.0824, 'learning_rate': 4.1070e-06, 'epoch': 1.54, 'throughput': 483.60}

[INFO|callbacks.py:310] 2024-07-16 10:01:51,421 >> {'loss': 0.0499, 'learning_rate': 4.0733e-06, 'epoch': 1.57, 'throughput': 483.63}

[INFO|callbacks.py:310] 2024-07-16 10:02:04,575 >> {'loss': 0.0413, 'learning_rate': 4.0392e-06, 'epoch': 1.59, 'throughput': 483.75}

[INFO|callbacks.py:310] 2024-07-16 10:02:17,738 >> {'loss': 0.0637, 'learning_rate': 4.0045e-06, 'epoch': 1.62, 'throughput': 484.01}

[INFO|callbacks.py:310] 2024-07-16 10:02:30,912 >> {'loss': 0.0529, 'learning_rate': 3.9695e-06, 'epoch': 1.65, 'throughput': 483.77}

[INFO|callbacks.py:310] 2024-07-16 10:02:44,068 >> {'loss': 0.0474, 'learning_rate': 3.9339e-06, 'epoch': 1.67, 'throughput': 483.73}

[INFO|callbacks.py:310] 2024-07-16 10:02:57,237 >> {'loss': 0.0649, 'learning_rate': 3.8980e-06, 'epoch': 1.70, 'throughput': 483.55}

[INFO|callbacks.py:310] 2024-07-16 10:03:10,409 >> {'loss': 0.0505, 'learning_rate': 3.8616e-06, 'epoch': 1.72, 'throughput': 483.51}

[INFO|callbacks.py:310] 2024-07-16 10:03:23,580 >> {'loss': 0.0621, 'learning_rate': 3.8248e-06, 'epoch': 1.75, 'throughput': 483.14}

[INFO|callbacks.py:310] 2024-07-16 10:03:36,735 >> {'loss': 0.0769, 'learning_rate': 3.7876e-06, 'epoch': 1.77, 'throughput': 483.20}

[INFO|callbacks.py:310] 2024-07-16 10:03:49,897 >> {'loss': 0.0435, 'learning_rate': 3.7500e-06, 'epoch': 1.80, 'throughput': 483.42}

[INFO|callbacks.py:310] 2024-07-16 10:04:03,040 >> {'loss': 0.0673, 'learning_rate': 3.7120e-06, 'epoch': 1.83, 'throughput': 483.69}

[INFO|callbacks.py:310] 2024-07-16 10:04:16,202 >> {'loss': 0.1316, 'learning_rate': 3.6737e-06, 'epoch': 1.85, 'throughput': 483.44}

[INFO|callbacks.py:310] 2024-07-16 10:04:29,356 >> {'loss': 0.0531, 'learning_rate': 3.6350e-06, 'epoch': 1.88, 'throughput': 483.53}

[INFO|callbacks.py:310] 2024-07-16 10:04:42,540 >> {'loss': 0.0287, 'learning_rate': 3.5959e-06, 'epoch': 1.90, 'throughput': 483.62}

[INFO|callbacks.py:310] 2024-07-16 10:04:55,704 >> {'loss': 0.0648, 'learning_rate': 3.5565e-06, 'epoch': 1.93, 'throughput': 483.59}

[INFO|callbacks.py:310] 2024-07-16 10:05:08,874 >> {'loss': 0.1211, 'learning_rate': 3.5168e-06, 'epoch': 1.95, 'throughput': 483.54}

[INFO|callbacks.py:310] 2024-07-16 10:05:22,046 >> {'loss': 0.0879, 'learning_rate': 3.4768e-06, 'epoch': 1.98, 'throughput': 483.26}

[INFO|callbacks.py:310] 2024-07-16 10:05:35,205 >> {'loss': 0.0227, 'learning_rate': 3.4365e-06, 'epoch': 2.01, 'throughput': 483.39}

[INFO|callbacks.py:310] 2024-07-16 10:05:48,359 >> {'loss': 0.0228, 'learning_rate': 3.3959e-06, 'epoch': 2.03, 'throughput': 483.45}

[INFO|callbacks.py:310] 2024-07-16 10:06:01,518 >> {'loss': 0.0360, 'learning_rate': 3.3551e-06, 'epoch': 2.06, 'throughput': 483.47}

[INFO|callbacks.py:310] 2024-07-16 10:06:14,696 >> {'loss': 0.0138, 'learning_rate': 3.3139e-06, 'epoch': 2.08, 'throughput': 483.36}

[INFO|callbacks.py:310] 2024-07-16 10:06:27,870 >> {'loss': 0.0697, 'learning_rate': 3.2725e-06, 'epoch': 2.11, 'throughput': 483.18}

[INFO|callbacks.py:310] 2024-07-16 10:06:41,041 >> {'loss': 0.0508, 'learning_rate': 3.2309e-06, 'epoch': 2.14, 'throughput': 482.89}

[INFO|callbacks.py:310] 2024-07-16 10:06:54,208 >> {'loss': 0.0088, 'learning_rate': 3.1891e-06, 'epoch': 2.16, 'throughput': 483.18}

[INFO|callbacks.py:310] 2024-07-16 10:07:07,375 >> {'loss': 0.0158, 'learning_rate': 3.1470e-06, 'epoch': 2.19, 'throughput': 483.34}

[INFO|callbacks.py:310] 2024-07-16 10:07:20,542 >> {'loss': 0.0060, 'learning_rate': 3.1048e-06, 'epoch': 2.21, 'throughput': 483.30}

[INFO|callbacks.py:310] 2024-07-16 10:07:33,693 >> {'loss': 0.0380, 'learning_rate': 3.0624e-06, 'epoch': 2.24, 'throughput': 483.67}

[INFO|callbacks.py:310] 2024-07-16 10:07:46,864 >> {'loss': 0.0004, 'learning_rate': 3.0198e-06, 'epoch': 2.26, 'throughput': 483.58}

[INFO|callbacks.py:310] 2024-07-16 10:08:00,047 >> {'loss': 0.0111, 'learning_rate': 2.9770e-06, 'epoch': 2.29, 'throughput': 483.47}

[INFO|callbacks.py:310] 2024-07-16 10:08:13,201 >> {'loss': 0.0008, 'learning_rate': 2.9341e-06, 'epoch': 2.32, 'throughput': 483.64}

[INFO|callbacks.py:310] 2024-07-16 10:08:26,357 >> {'loss': 0.0182, 'learning_rate': 2.8911e-06, 'epoch': 2.34, 'throughput': 483.70}

[INFO|callbacks.py:310] 2024-07-16 10:08:39,526 >> {'loss': 0.0491, 'learning_rate': 2.8479e-06, 'epoch': 2.37, 'throughput': 483.66}

[INFO|callbacks.py:310] 2024-07-16 10:08:52,691 >> {'loss': 0.0040, 'learning_rate': 2.8047e-06, 'epoch': 2.39, 'throughput': 483.71}

[INFO|callbacks.py:310] 2024-07-16 10:09:05,854 >> {'loss': 0.0176, 'learning_rate': 2.7613e-06, 'epoch': 2.42, 'throughput': 483.76}

[INFO|callbacks.py:310] 2024-07-16 10:09:19,001 >> {'loss': 0.0190, 'learning_rate': 2.7179e-06, 'epoch': 2.44, 'throughput': 483.69}

[INFO|callbacks.py:310] 2024-07-16 10:09:32,181 >> {'loss': 0.0270, 'learning_rate': 2.6744e-06, 'epoch': 2.47, 'throughput': 483.49}

[INFO|callbacks.py:310] 2024-07-16 10:09:45,346 >> {'loss': 0.0354, 'learning_rate': 2.6308e-06, 'epoch': 2.50, 'throughput': 483.49}

[INFO|callbacks.py:310] 2024-07-16 10:09:58,504 >> {'loss': 0.0741, 'learning_rate': 2.5872e-06, 'epoch': 2.52, 'throughput': 483.59}

[INFO|callbacks.py:310] 2024-07-16 10:10:11,684 >> {'loss': 0.0582, 'learning_rate': 2.5436e-06, 'epoch': 2.55, 'throughput': 483.53}

[INFO|callbacks.py:310] 2024-07-16 10:10:24,850 >> {'loss': 0.0096, 'learning_rate': 2.5000e-06, 'epoch': 2.57, 'throughput': 483.66}

[INFO|callbacks.py:310] 2024-07-16 10:10:38,015 >> {'loss': 0.0263, 'learning_rate': 2.4564e-06, 'epoch': 2.60, 'throughput': 483.71}

[INFO|callbacks.py:310] 2024-07-16 10:10:51,176 >> {'loss': 0.0121, 'learning_rate': 2.4128e-06, 'epoch': 2.62, 'throughput': 483.65}

[INFO|callbacks.py:310] 2024-07-16 10:11:04,355 >> {'loss': 0.0204, 'learning_rate': 2.3692e-06, 'epoch': 2.65, 'throughput': 483.62}

[INFO|callbacks.py:310] 2024-07-16 10:11:17,518 >> {'loss': 0.0325, 'learning_rate': 2.3256e-06, 'epoch': 2.68, 'throughput': 483.74}

[INFO|callbacks.py:310] 2024-07-16 10:11:30,679 >> {'loss': 0.0076, 'learning_rate': 2.2821e-06, 'epoch': 2.70, 'throughput': 483.58}

[INFO|callbacks.py:310] 2024-07-16 10:11:43,845 >> {'loss': 0.0485, 'learning_rate': 2.2387e-06, 'epoch': 2.73, 'throughput': 483.48}

[INFO|callbacks.py:310] 2024-07-16 10:11:57,010 >> {'loss': 0.0070, 'learning_rate': 2.1953e-06, 'epoch': 2.75, 'throughput': 483.31}

[INFO|callbacks.py:310] 2024-07-16 10:12:10,178 >> {'loss': 0.0347, 'learning_rate': 2.1521e-06, 'epoch': 2.78, 'throughput': 483.23}

[INFO|callbacks.py:310] 2024-07-16 10:12:23,333 >> {'loss': 0.0142, 'learning_rate': 2.1089e-06, 'epoch': 2.80, 'throughput': 483.41}

[INFO|callbacks.py:310] 2024-07-16 10:12:36,503 >> {'loss': 0.0414, 'learning_rate': 2.0659e-06, 'epoch': 2.83, 'throughput': 483.41}

[INFO|callbacks.py:310] 2024-07-16 10:12:49,670 >> {'loss': 0.0419, 'learning_rate': 2.0230e-06, 'epoch': 2.86, 'throughput': 483.45}

[INFO|callbacks.py:310] 2024-07-16 10:13:02,837 >> {'loss': 0.0430, 'learning_rate': 1.9802e-06, 'epoch': 2.88, 'throughput': 483.52}

[INFO|callbacks.py:310] 2024-07-16 10:13:15,995 >> {'loss': 0.0192, 'learning_rate': 1.9376e-06, 'epoch': 2.91, 'throughput': 483.49}

[INFO|callbacks.py:310] 2024-07-16 10:13:29,163 >> {'loss': 0.0427, 'learning_rate': 1.8952e-06, 'epoch': 2.93, 'throughput': 483.53}

[INFO|callbacks.py:310] 2024-07-16 10:13:42,332 >> {'loss': 0.0116, 'learning_rate': 1.8530e-06, 'epoch': 2.96, 'throughput': 483.44}

[INFO|callbacks.py:310] 2024-07-16 10:13:55,503 >> {'loss': 0.0135, 'learning_rate': 1.8109e-06, 'epoch': 2.98, 'throughput': 483.38}

[INFO|callbacks.py:310] 2024-07-16 10:14:08,655 >> {'loss': 0.0128, 'learning_rate': 1.7691e-06, 'epoch': 3.01, 'throughput': 483.40}

[INFO|callbacks.py:310] 2024-07-16 10:14:21,830 >> {'loss': 0.0021, 'learning_rate': 1.7275e-06, 'epoch': 3.04, 'throughput': 483.50}

[INFO|callbacks.py:310] 2024-07-16 10:14:35,006 >> {'loss': 0.0057, 'learning_rate': 1.6861e-06, 'epoch': 3.06, 'throughput': 483.41}

[INFO|callbacks.py:310] 2024-07-16 10:14:48,169 >> {'loss': 0.0197, 'learning_rate': 1.6449e-06, 'epoch': 3.09, 'throughput': 483.37}

[INFO|callbacks.py:310] 2024-07-16 10:15:01,334 >> {'loss': 0.0017, 'learning_rate': 1.6041e-06, 'epoch': 3.11, 'throughput': 483.22}

[INFO|callbacks.py:310] 2024-07-16 10:15:14,501 >> {'loss': 0.0068, 'learning_rate': 1.5635e-06, 'epoch': 3.14, 'throughput': 483.07}

[INFO|callbacks.py:310] 2024-07-16 10:15:27,662 >> {'loss': 0.0022, 'learning_rate': 1.5232e-06, 'epoch': 3.16, 'throughput': 483.02}

[INFO|callbacks.py:310] 2024-07-16 10:15:40,803 >> {'loss': 0.0162, 'learning_rate': 1.4832e-06, 'epoch': 3.19, 'throughput': 483.18}

[INFO|callbacks.py:310] 2024-07-16 10:15:53,978 >> {'loss': 0.0014, 'learning_rate': 1.4435e-06, 'epoch': 3.22, 'throughput': 483.24}

[INFO|callbacks.py:310] 2024-07-16 10:16:07,150 >> {'loss': 0.0063, 'learning_rate': 1.4041e-06, 'epoch': 3.24, 'throughput': 483.23}

[INFO|callbacks.py:310] 2024-07-16 10:16:20,313 >> {'loss': 0.0282, 'learning_rate': 1.3650e-06, 'epoch': 3.27, 'throughput': 483.34}

[INFO|callbacks.py:310] 2024-07-16 10:16:33,471 >> {'loss': 0.0003, 'learning_rate': 1.3263e-06, 'epoch': 3.29, 'throughput': 483.41}

[INFO|callbacks.py:310] 2024-07-16 10:16:46,637 >> {'loss': 0.0002, 'learning_rate': 1.2880e-06, 'epoch': 3.32, 'throughput': 483.37}

[INFO|callbacks.py:310] 2024-07-16 10:16:59,801 >> {'loss': 0.0004, 'learning_rate': 1.2500e-06, 'epoch': 3.34, 'throughput': 483.38}

[INFO|callbacks.py:310] 2024-07-16 10:17:12,952 >> {'loss': 0.0169, 'learning_rate': 1.2124e-06, 'epoch': 3.37, 'throughput': 483.44}

[INFO|callbacks.py:310] 2024-07-16 10:17:26,129 >> {'loss': 0.0127, 'learning_rate': 1.1752e-06, 'epoch': 3.40, 'throughput': 483.34}

[INFO|callbacks.py:310] 2024-07-16 10:17:39,308 >> {'loss': 0.0045, 'learning_rate': 1.1384e-06, 'epoch': 3.42, 'throughput': 483.25}

[INFO|callbacks.py:310] 2024-07-16 10:17:52,479 >> {'loss': 0.0924, 'learning_rate': 1.1020e-06, 'epoch': 3.45, 'throughput': 483.31}

[INFO|callbacks.py:310] 2024-07-16 10:18:05,645 >> {'loss': 0.0067, 'learning_rate': 1.0661e-06, 'epoch': 3.47, 'throughput': 483.33}

[INFO|callbacks.py:310] 2024-07-16 10:18:18,814 >> {'loss': 0.0030, 'learning_rate': 1.0305e-06, 'epoch': 3.50, 'throughput': 483.19}

[INFO|callbacks.py:310] 2024-07-16 10:18:31,962 >> {'loss': 0.0164, 'learning_rate': 9.9546e-07, 'epoch': 3.52, 'throughput': 483.29}

[INFO|callbacks.py:310] 2024-07-16 10:18:45,120 >> {'loss': 0.0018, 'learning_rate': 9.6085e-07, 'epoch': 3.55, 'throughput': 483.30}

[INFO|callbacks.py:310] 2024-07-16 10:18:58,287 >> {'loss': 0.0226, 'learning_rate': 9.2670e-07, 'epoch': 3.58, 'throughput': 483.32}

[INFO|callbacks.py:310] 2024-07-16 10:19:11,468 >> {'loss': 0.0008, 'learning_rate': 8.9303e-07, 'epoch': 3.60, 'throughput': 483.26}

[INFO|callbacks.py:310] 2024-07-16 10:19:24,632 >> {'loss': 0.0004, 'learning_rate': 8.5985e-07, 'epoch': 3.63, 'throughput': 483.13}

[INFO|callbacks.py:310] 2024-07-16 10:19:37,805 >> {'loss': 0.0008, 'learning_rate': 8.2717e-07, 'epoch': 3.65, 'throughput': 483.16}

[INFO|callbacks.py:310] 2024-07-16 10:19:50,961 >> {'loss': 0.0256, 'learning_rate': 7.9500e-07, 'epoch': 3.68, 'throughput': 483.12}

[INFO|callbacks.py:310] 2024-07-16 10:20:04,127 >> {'loss': 0.0005, 'learning_rate': 7.6335e-07, 'epoch': 3.70, 'throughput': 483.08}

[INFO|callbacks.py:310] 2024-07-16 10:20:17,283 >> {'loss': 0.0045, 'learning_rate': 7.3223e-07, 'epoch': 3.73, 'throughput': 483.15}

[INFO|callbacks.py:310] 2024-07-16 10:20:30,443 >> {'loss': 0.0005, 'learning_rate': 7.0165e-07, 'epoch': 3.76, 'throughput': 482.98}

[INFO|callbacks.py:310] 2024-07-16 10:20:43,619 >> {'loss': 0.0069, 'learning_rate': 6.7162e-07, 'epoch': 3.78, 'throughput': 483.23}

[INFO|callbacks.py:310] 2024-07-16 10:20:56,776 >> {'loss': 0.0150, 'learning_rate': 6.4214e-07, 'epoch': 3.81, 'throughput': 483.29}

[INFO|callbacks.py:310] 2024-07-16 10:21:09,946 >> {'loss': 0.0012, 'learning_rate': 6.1323e-07, 'epoch': 3.83, 'throughput': 483.32}

[INFO|callbacks.py:310] 2024-07-16 10:21:23,109 >> {'loss': 0.0095, 'learning_rate': 5.8489e-07, 'epoch': 3.86, 'throughput': 483.33}

[INFO|callbacks.py:310] 2024-07-16 10:21:36,282 >> {'loss': 0.0271, 'learning_rate': 5.5714e-07, 'epoch': 3.88, 'throughput': 483.39}

[INFO|callbacks.py:310] 2024-07-16 10:21:49,454 >> {'loss': 0.0201, 'learning_rate': 5.2997e-07, 'epoch': 3.91, 'throughput': 483.30}

[INFO|callbacks.py:310] 2024-07-16 10:22:02,608 >> {'loss': 0.0120, 'learning_rate': 5.0341e-07, 'epoch': 3.94, 'throughput': 483.25}

[INFO|callbacks.py:310] 2024-07-16 10:22:15,786 >> {'loss': 0.0230, 'learning_rate': 4.7746e-07, 'epoch': 3.96, 'throughput': 483.29}

[INFO|callbacks.py:310] 2024-07-16 10:22:28,957 >> {'loss': 0.0156, 'learning_rate': 4.5212e-07, 'epoch': 3.99, 'throughput': 483.22}

[INFO|callbacks.py:310] 2024-07-16 10:22:42,130 >> {'loss': 0.0009, 'learning_rate': 4.2741e-07, 'epoch': 4.01, 'throughput': 483.29}

[INFO|callbacks.py:310] 2024-07-16 10:22:55,293 >> {'loss': 0.0017, 'learning_rate': 4.0332e-07, 'epoch': 4.04, 'throughput': 483.27}

[INFO|callbacks.py:310] 2024-07-16 10:23:08,453 >> {'loss': 0.0015, 'learning_rate': 3.7988e-07, 'epoch': 4.06, 'throughput': 483.28}

[INFO|callbacks.py:310] 2024-07-16 10:23:21,618 >> {'loss': 0.0035, 'learning_rate': 3.5708e-07, 'epoch': 4.09, 'throughput': 483.18}

[INFO|callbacks.py:310] 2024-07-16 10:23:34,786 >> {'loss': 0.0016, 'learning_rate': 3.3494e-07, 'epoch': 4.12, 'throughput': 483.27}

[INFO|callbacks.py:310] 2024-07-16 10:23:47,940 >> {'loss': 0.0028, 'learning_rate': 3.1345e-07, 'epoch': 4.14, 'throughput': 483.30}

[INFO|callbacks.py:310] 2024-07-16 10:24:01,115 >> {'loss': 0.0006, 'learning_rate': 2.9263e-07, 'epoch': 4.17, 'throughput': 483.34}

[INFO|callbacks.py:310] 2024-07-16 10:24:14,287 >> {'loss': 0.0013, 'learning_rate': 2.7248e-07, 'epoch': 4.19, 'throughput': 483.39}

[INFO|callbacks.py:310] 2024-07-16 10:24:27,446 >> {'loss': 0.0006, 'learning_rate': 2.5301e-07, 'epoch': 4.22, 'throughput': 483.37}

[INFO|callbacks.py:310] 2024-07-16 10:24:40,617 >> {'loss': 0.0017, 'learning_rate': 2.3423e-07, 'epoch': 4.24, 'throughput': 483.25}

[INFO|callbacks.py:310] 2024-07-16 10:24:53,794 >> {'loss': 0.0004, 'learning_rate': 2.1614e-07, 'epoch': 4.27, 'throughput': 483.29}

[INFO|callbacks.py:310] 2024-07-16 10:25:06,960 >> {'loss': 0.0049, 'learning_rate': 1.9874e-07, 'epoch': 4.30, 'throughput': 483.28}

[INFO|callbacks.py:310] 2024-07-16 10:25:20,117 >> {'loss': 0.0071, 'learning_rate': 1.8204e-07, 'epoch': 4.32, 'throughput': 483.25}

[INFO|callbacks.py:310] 2024-07-16 10:25:33,302 >> {'loss': 0.0011, 'learning_rate': 1.6605e-07, 'epoch': 4.35, 'throughput': 483.17}

[INFO|callbacks.py:310] 2024-07-16 10:25:46,468 >> {'loss': 0.0004, 'learning_rate': 1.5077e-07, 'epoch': 4.37, 'throughput': 483.17}

[INFO|callbacks.py:310] 2024-07-16 10:25:59,629 >> {'loss': 0.0007, 'learning_rate': 1.3620e-07, 'epoch': 4.40, 'throughput': 483.20}

[INFO|callbacks.py:310] 2024-07-16 10:26:12,794 >> {'loss': 0.0017, 'learning_rate': 1.2236e-07, 'epoch': 4.42, 'throughput': 483.21}

[INFO|callbacks.py:310] 2024-07-16 10:26:25,961 >> {'loss': 0.0007, 'learning_rate': 1.0924e-07, 'epoch': 4.45, 'throughput': 483.29}

[INFO|callbacks.py:310] 2024-07-16 10:26:39,133 >> {'loss': 0.0003, 'learning_rate': 9.6846e-08, 'epoch': 4.48, 'throughput': 483.18}

[INFO|callbacks.py:310] 2024-07-16 10:26:52,302 >> {'loss': 0.0046, 'learning_rate': 8.5185e-08, 'epoch': 4.50, 'throughput': 483.13}

[INFO|callbacks.py:310] 2024-07-16 10:27:05,483 >> {'loss': 0.0038, 'learning_rate': 7.4261e-08, 'epoch': 4.53, 'throughput': 483.04}

[INFO|callbacks.py:310] 2024-07-16 10:27:18,649 >> {'loss': 0.0036, 'learning_rate': 6.4075e-08, 'epoch': 4.55, 'throughput': 483.09}

[INFO|callbacks.py:310] 2024-07-16 10:27:31,802 >> {'loss': 0.0056, 'learning_rate': 5.4631e-08, 'epoch': 4.58, 'throughput': 483.09}

[INFO|callbacks.py:310] 2024-07-16 10:27:44,968 >> {'loss': 0.0057, 'learning_rate': 4.5932e-08, 'epoch': 4.60, 'throughput': 483.12}

[INFO|callbacks.py:310] 2024-07-16 10:27:58,128 >> {'loss': 0.0020, 'learning_rate': 3.7981e-08, 'epoch': 4.63, 'throughput': 483.19}

[INFO|callbacks.py:310] 2024-07-16 10:28:11,283 >> {'loss': 0.0003, 'learning_rate': 3.0779e-08, 'epoch': 4.66, 'throughput': 483.12}

[INFO|callbacks.py:310] 2024-07-16 10:28:24,450 >> {'loss': 0.0002, 'learning_rate': 2.4330e-08, 'epoch': 4.68, 'throughput': 483.03}

[INFO|callbacks.py:310] 2024-07-16 10:28:37,620 >> {'loss': 0.0043, 'learning_rate': 1.8635e-08, 'epoch': 4.71, 'throughput': 482.89}

[INFO|callbacks.py:310] 2024-07-16 10:28:50,799 >> {'loss': 0.0002, 'learning_rate': 1.3695e-08, 'epoch': 4.73, 'throughput': 482.81}

[INFO|callbacks.py:310] 2024-07-16 10:29:03,962 >> {'loss': 0.0013, 'learning_rate': 9.5133e-09, 'epoch': 4.76, 'throughput': 482.82}

[INFO|callbacks.py:310] 2024-07-16 10:29:17,116 >> {'loss': 0.0023, 'learning_rate': 6.0899e-09, 'epoch': 4.78, 'throughput': 482.85}

[INFO|callbacks.py:310] 2024-07-16 10:29:30,281 >> {'loss': 0.0002, 'learning_rate': 3.4262e-09, 'epoch': 4.81, 'throughput': 482.98}

[INFO|callbacks.py:310] 2024-07-16 10:29:43,438 >> {'loss': 0.0015, 'learning_rate': 1.5229e-09, 'epoch': 4.84, 'throughput': 482.95}

[INFO|callbacks.py:310] 2024-07-16 10:29:56,602 >> {'loss': 0.0002, 'learning_rate': 3.8076e-10, 'epoch': 4.86, 'throughput': 482.96}

[INFO|callbacks.py:310] 2024-07-16 10:30:09,755 >> {'loss': 0.0028, 'learning_rate': 0.0000e+00, 'epoch': 4.89, 'throughput': 482.97}

[INFO|trainer.py:3478] 2024-07-16 10:30:17,367 >> Saving model checkpoint to saves/LLaMA3-8B-Chat/full/train_2024-07-16-09-46-28_llama3/checkpoint-190

[INFO|configuration_utils.py:472] 2024-07-16 10:30:17,370 >> Configuration saved in saves/LLaMA3-8B-Chat/full/train_2024-07-16-09-46-28_llama3/checkpoint-190/config.json

[INFO|configuration_utils.py:769] 2024-07-16 10:30:17,371 >> Configuration saved in saves/LLaMA3-8B-Chat/full/train_2024-07-16-09-46-28_llama3/checkpoint-190/generation_config.json

[INFO|modeling_utils.py:2698] 2024-07-16 10:30:33,564 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 4 checkpoint shards. You can find where each parameters has been saved in the index located at saves/LLaMA3-8B-Chat/full/train_2024-07-16-09-46-28_llama3/checkpoint-190/model.safetensors.index.json.

[INFO|tokenization_utils_base.py:2574] 2024-07-16 10:30:33,568 >> tokenizer config file saved in saves/LLaMA3-8B-Chat/full/train_2024-07-16-09-46-28_llama3/checkpoint-190/tokenizer_config.json

[INFO|tokenization_utils_base.py:2583] 2024-07-16 10:30:33,568 >> Special tokens file saved in saves/LLaMA3-8B-Chat/full/train_2024-07-16-09-46-28_llama3/checkpoint-190/special_tokens_map.json

[INFO|trainer.py:2383] 2024-07-16 10:31:10,372 >> 

Training completed. Do not forget to share your model on huggingface.co/models =)


[INFO|trainer.py:3478] 2024-07-16 10:31:17,984 >> Saving model checkpoint to saves/LLaMA3-8B-Chat/full/train_2024-07-16-09-46-28_llama3

[INFO|configuration_utils.py:472] 2024-07-16 10:31:17,987 >> Configuration saved in saves/LLaMA3-8B-Chat/full/train_2024-07-16-09-46-28_llama3/config.json

[INFO|configuration_utils.py:769] 2024-07-16 10:31:17,988 >> Configuration saved in saves/LLaMA3-8B-Chat/full/train_2024-07-16-09-46-28_llama3/generation_config.json

[INFO|modeling_utils.py:2698] 2024-07-16 10:31:35,440 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 4 checkpoint shards. You can find where each parameters has been saved in the index located at saves/LLaMA3-8B-Chat/full/train_2024-07-16-09-46-28_llama3/model.safetensors.index.json.

[INFO|tokenization_utils_base.py:2574] 2024-07-16 10:31:35,443 >> tokenizer config file saved in saves/LLaMA3-8B-Chat/full/train_2024-07-16-09-46-28_llama3/tokenizer_config.json

[INFO|tokenization_utils_base.py:2583] 2024-07-16 10:31:35,444 >> Special tokens file saved in saves/LLaMA3-8B-Chat/full/train_2024-07-16-09-46-28_llama3/special_tokens_map.json

[WARNING|ploting.py:89] 2024-07-16 10:31:36,770 >> No metric eval_loss to plot.

[WARNING|ploting.py:89] 2024-07-16 10:31:36,770 >> No metric eval_accuracy to plot.

[INFO|modelcard.py:449] 2024-07-16 10:31:36,770 >> Dropping the following result as it does not have all the necessary fields:
{'task': {'name': 'Causal Language Modeling', 'type': 'text-generation'}}