update

Browse files

Files changed (10) hide show

README.md +85 -0
config.json +2 -1
configuration_mixtral.py +8 -2
model-00001-of-00004.safetensors +1 -1
model-00002-of-00004.safetensors +1 -1
model-00003-of-00004.safetensors +1 -1
model-00004-of-00004.safetensors +1 -1
modeling_mixtral.py +888 -52
trainer_state.json +2027 -907
training_args.bin +2 -2

README.md CHANGED Viewed

@@ -1,3 +1,88 @@
 ---
 license: apache-2.0
 ---

 ---
 license: apache-2.0
+language:
+- en
+tags:
+- MoE
 ---
+# LLaMA-MoE-v2-3.8B (2/8) SFT
+[[💻 Code]](https://github.com/OpenSparseLLMs/LLaMA-MoE-v2) | [[📃 Technical Report]](https://arxiv.org/pdf/2411.15708)
+LLaMA-MoE-v2 is a series of open-sourced Mixture-of-Expert (MoE) models based on [LLaMA3](https://github.com/facebookresearch/llama).
+We build LLaMA-MoE-v2 with the following two steps:
+1. **Partition** LLaMA's FFN layers or Attention layers into sparse experts and insert top-K gate for each layer of experts.
+2. Supervised fine-tuning the constructed MoE models using open-source data with a two-stage training.
+| Model                     | \#Activated Experts | \#Experts | \#Activated Params |                      SFT Model                                             |
+| :-----------------------: | :-----------------: | :-------: | :----------------: | :------------------------------------------------------------------------: |
+| **LLaMA-MLP-MoE (2/8)**   |          2          |     8     |        3.8B        | [🤗 SFT](https://huggingface.co/llama-moe/LLaMA-MoE-v2-3_8B-2_8-sft)       |
+| **LLaMA-MLP-MoE (1+1/7)** |          2          |     8     |        3.8B        | [🤗 SFT](https://huggingface.co/llama-moe/LLaMA-MoE-v2-3_8B-residual-sft)  |
+## 🚀 QuickStart
+```python
+# python>=3.10
+import torch
+from transformers import AutoTokenizer, AutoModelForCausalLM
+model_dir = "llama-moe/LLaMA-MoE-v2-3_8B-2_8-sft"
+tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(model_dir, torch_dtype=torch.bfloat16, trust_remote_code=True)
+model.eval()
+model.cuda()
+input_text = "Could you recommend me some mystery novels?"
+input_text = f"<|start_header_id|>user<|end_header_id|>\n\n{input_text}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
+inputs = tokenizer(input_text, return_tensors="pt")
+input_ids = inputs["input_ids"].cuda()
+pred = model.generate(input_ids, max_length=200, temperature=1.0, do_sample=True, use_cache=True)
+print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
+"""
+I'd be delighted to recommend some mystery novels to you! Here are a few suggestions across various sub-genres:
+**Classic Whodunit**
+1. "And Then There Were None" by Agatha Christie - A timeless tale of ten strangers who are invited to an isolated island, only to be killed off one by one.
+2. "The Murder on the Orient Express" by Agatha Christie - A classic whodunit set on a luxurious train traveling from Istanbul to Paris, where a famous author goes missing.
+3. "The Devil in the White City" by Erik Larson - A non-fiction book that combines historical events with a mystery, exploring the 1893 World's Columbian Exposition in Chicago and the serial killer H.H. Holmes.
+**Modern Whodunits**
+1. "Gone Girl" by Gillian Flynn - A twisty, psychological thriller about a couple whose seemingly perfect ...
+"""
+```
+## 📊 Performance
+| Model | #Training Tokens | MMLU(5) | GSM8k(8) | HumanEval(pass@10) | IFEval | BoolQ(32) | SciQ | PIQA | ARC-c(25) | TruthfulQA | HellaSwag(10) |
+|:---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
+| [LLaMA3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) | 15T | 67.2 | 76.5 | 71.4 | 76.5 | 83.0 | 93.2 | 78.5 | 61.9 | 51.7 | 78.8 |
+| [INCITE-3B](https://huggingface.co/togethercomputer/RedPajama-INCITE-Instruct-3B-v1) | 1T | 25.1 | 2.1 | 6.92 | 30.1 | 66.5 | 94.7 | 74.4 | 40.2 | 36.4 | 65.6 |
+| [Sheared-LLaMA-2.7B](https://huggingface.co/princeton-nlp/Sheared-LLaMA-2.7B-ShareGPT) | 50B | 28.2 | 1.9 | 3.2 | 28.8 | 67.6 | 75.8 | 41.1 | 47.6 | 71.2 | 39.0 |
+| [Gemma-2-2b](https://huggingface.co/google/gemma-2-2b-it) | 2T | 53.0 | 26.3 | 46.1 | 34.9 | 72.3 | 75.8 | 67.5 | 52.6 | 50.8 | 69.0 |
+| [Salamandra-2b](https://huggingface.co/BSC-LT/salamandra-2b-instruct) | 7.8T | 25.1 | 1.90 | 5.82 | 27.7 | 68.0 | 89.8 | 74.7 | 46.3 | 43.4 | 62.3 |
+| [SmolLM2-1.7B](https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B-Instruct) | 11T | 50.4 | 38.5 | 39.1 | 29.0 | 68.2 | 84.3 | 76.0 | 53.2 | 39.9 | 72.6 |
+| [OpenMoE-3B-9B](https://huggingface.co/OrionZheng/openmoe-8b-chat) | 1T | 26.5 | 1.36 | 1.01 | 31.2 | 61.7 | 68.4 | 65.7 | 33.3 | 40.5 | 56.5 |
+| [LLaMA-MoE-3B-7B](https://huggingface.co/llama-moe/LLaMA-MoE-v1-3_5B-2_8-sft) | 200B | 28.2 | 4.62 | 12.0 | 28.1 | 68.1 | 88.8 | 77.9 | 44.0 | 33.3 | 73.2 |
+| [OLMoE-1B-7B](https://huggingface.co/allenai/OLMoE-1B-7B-0924-SFT) | 1T | 53.8 | 40.9 | 40.5 | 35.5 | 80.9 | 94.9 | 80.1 | 55.6 | 43.3 | 79.6 |
+| **MLP-MoE (8top2)** | **7B** | 40.6 | 53.1 | 53.5 | 32.7 | 74.6 | 90.6 | 69.3 | 42.8 | 45.6 | 59.0 |
+| **MLP-MoE (8top2)** | **8.4B** | 41.0 | **59.6** | **57.1** | 31.7 | 74.5 | 90.2 | 69.5 | 43.3 | 46.9 | 58.1 |
+| **MLP-MoE (1+7top1)** | **7B** | 42.7 | 55.0 | 51.2 | **36.0** | 76.9 | 88.8 | 67.9 | 40.2 | 46.9 | 53.7 |
+## 📃 Citation
+```bibtex
+@misc{llama-moe-v2,
+  title={LLaMA-MoE v2: Exploring Sparsity of LLaMA from Perspective of Mixture-of-Experts with Post-Training},
+  author={Xiaoye Qu, Daize Dong, Xuyang Hu, Tong Zhu, Weigao Sun, Yu Cheng},
+  year={2024},
+  month={Nov},
+  url={https://arxiv.org/abs/2411.15708}
+}
+```

config.json CHANGED Viewed

@@ -1,11 +1,12 @@
 {
-  "_name_or_path": "/mnt/petrelfs/huxuyang/LLaMA-MoE-v2/outputs/v2_mixtral/moe-res-droppad-nosys-all/3653852/checkpoint-3600",
   "add_rescale_bias": false,
   "architectures": [
     "MixtralForCausalLM"
   ],
   "attention_bias": false,
   "attention_dropout": 0.0,
   "auto_map": {
     "AutoConfig": "configuration_mixtral.MixtralConfig",
     "AutoModel": "modeling_mixtral.MixtralModel",

 {
+  "_name_or_path": "/mnt/petrelfs/quxiaoye/models/sft-v2/moe8top2_onestage",
   "add_rescale_bias": false,
   "architectures": [
     "MixtralForCausalLM"
   ],
   "attention_bias": false,
   "attention_dropout": 0.0,
+  "attn_experts": null,
   "auto_map": {
     "AutoConfig": "configuration_mixtral.MixtralConfig",
     "AutoModel": "modeling_mixtral.MixtralModel",

configuration_mixtral.py CHANGED Viewed

@@ -170,6 +170,7 @@ class MixtralConfig(PretrainedConfig):
         num_moe_contract_layers: int = 0,  # 🔍 the number of layers that are not converted into MoE at each side of the model
         use_attn_moe: bool = False,  # 🔍
         top_k_attn: int = None,  # 🔍
         scale_factor_attn: float = None,  # 🔍
         use_layer_wise_balance: bool = False,  # ✨ whether to fix the balance loss bug for Mixtral
         add_rescale_bias: bool = False,  # 🔍 whether to add bias to the AttentionMoE `o_proj` & MoE `down_proj` for distribution alignment
@@ -208,6 +209,7 @@ class MixtralConfig(PretrainedConfig):
         self.use_attn_moe = use_attn_moe
         self.top_k_attn = top_k_attn
         self.scale_factor_attn = scale_factor_attn
         # ✨ For balance loss bugfix
         self.use_layer_wise_balance = use_layer_wise_balance
@@ -232,11 +234,15 @@ class MixtralConfig(PretrainedConfig):
         if hasattr(self, "_attn_implementation_internal"):
             if self._attn_implementation_internal is None:
                 # `config.attn_implementation` should never be None, for backward compatibility.
-                return "eager"
             else:
                 return self._attn_implementation_internal
         else:
-            return "eager"
     @_attn_implementation.setter
     def _attn_implementation(self, value):

         num_moe_contract_layers: int = 0,  # 🔍 the number of layers that are not converted into MoE at each side of the model
         use_attn_moe: bool = False,  # 🔍
         top_k_attn: int = None,  # 🔍
+        attn_experts: int = None,
         scale_factor_attn: float = None,  # 🔍
         use_layer_wise_balance: bool = False,  # ✨ whether to fix the balance loss bug for Mixtral
         add_rescale_bias: bool = False,  # 🔍 whether to add bias to the AttentionMoE `o_proj` & MoE `down_proj` for distribution alignment
         self.use_attn_moe = use_attn_moe
         self.top_k_attn = top_k_attn
         self.scale_factor_attn = scale_factor_attn
+        self.attn_experts = attn_experts
         # ✨ For balance loss bugfix
         self.use_layer_wise_balance = use_layer_wise_balance
         if hasattr(self, "_attn_implementation_internal"):
             if self._attn_implementation_internal is None:
                 # `config.attn_implementation` should never be None, for backward compatibility.
+                return "flash_attention_2"
+                # return "eager"
             else:
                 return self._attn_implementation_internal
         else:
+            return "flash_attention_2"
+            # return "eager"
     @_attn_implementation.setter
     def _attn_implementation(self, value):

model-00001-of-00004.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:e5bc31ca5dbdc23c38713b734d2654cfa413133981c35bdb633ea0d310f90cb8
 size 4977314560

 version https://git-lfs.github.com/spec/v1
+oid sha256:d5c37f87fd8cb399be7701cafd53561b462b68451df8888e37e27a87afd9cd80
 size 4977314560

model-00002-of-00004.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:762ce2834feae9ba9f238e4d927104291da3d73198328f31ebd722c6429cae17
 size 4985941976

 version https://git-lfs.github.com/spec/v1
+oid sha256:0f24b0cd37967f622d52fb345d76dcd0f26d41959a70d1bf940b9ca28f9f2bef
 size 4985941976

model-00003-of-00004.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:504bebbc7dabc69a76ef584204bdcbbcef1f31b9e61e39aa5c96690aa9461522
 size 4990070968

 version https://git-lfs.github.com/spec/v1
+oid sha256:501a1bdc13d200b85e7f9be67535da141dd39092f31dd91d90f220469d67d395
 size 4990070968

model-00004-of-00004.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:59ac862ad596b661d4101aba2d2555e37c3ab8617e4bb4107737cfe63e7aca40
 size 1109418960

 version https://git-lfs.github.com/spec/v1
+oid sha256:086b141ea5a5163e6fcbaf0db9fa5439476fa4eac16c8b1cb0f4de33d8ceebb7
 size 1109418960

modeling_mixtral.py CHANGED Viewed

@@ -49,8 +49,6 @@ from transformers.utils.import_utils import (
     is_torchdynamo_compiling,
 )
-from smoe.utils.modeling_attn_mask_utils import _prepare_4d_causal_attention_mask
 from .configuration_mixtral import MixtralConfig
 logger = logging.get_logger(__name__)
@@ -123,6 +121,338 @@ def is_flash_attn_available():
     return is_flash_attn_2_available()
 @dataclass
 class MoeCausalLMOutputWithPast(ModelOutput):
     """
@@ -270,7 +600,7 @@ def load_balancing_loss_func(
     Returns:
         The auxiliary loss.
     """
-    if gate_logits is None:
         return 0
     # ✨ Here is the fix for balance loss in Mixtral.
@@ -812,16 +1142,20 @@ class MixtralAttentionMoE(MixtralAttention):
             )
         # 🔍
-        self.gate = nn.Linear(self.hidden_size, self.num_key_value_heads, bias=False)
         self.softmax = nn.Softmax(dim=-1)
         self.top_k_attn = config.top_k_attn
         self.scale_factor_attn = config.scale_factor_attn
         # 🔍
-        self.q_proj = nn.ModuleList([nn.Linear(self.hidden_size, self.num_key_value_groups * self.head_dim, bias=False) for _ in range(self.num_key_value_heads)])
-        self.k_proj = nn.ModuleList([nn.Linear(self.hidden_size, self.head_dim, bias=False) for _ in range(self.num_key_value_heads)])
-        self.v_proj = nn.ModuleList([nn.Linear(self.hidden_size, self.head_dim, bias=False) for _ in range(self.num_key_value_heads)])
-        self.o_proj = nn.ModuleList([nn.Linear(self.num_key_value_groups * self.head_dim, self.hidden_size, bias=config.add_rescale_bias) for _ in range(self.num_key_value_heads)])  # 🔍 (may add bias for rescaling)
         self.rotary_emb = MixtralRotaryEmbedding(
             self.head_dim,
@@ -847,6 +1181,7 @@ class MixtralAttentionMoE(MixtralAttention):
             raise TypeError(
                 "`past_key_value` must be a `MoECache` instance for attention MoE!"
             )
         device = hidden_states.device
         dtype = hidden_states.dtype
         bsz, q_len, hidden_dim = hidden_states.size()
@@ -865,12 +1200,12 @@ class MixtralAttentionMoE(MixtralAttention):
         # One hot encode the selected experts to create an expert mask
         # this will be used to easily index which expert is going to be sollicitated
-        expert_mask = torch.nn.functional.one_hot(selected_experts, num_classes=self.num_key_value_heads)  # (bsz * q_len, top_k_attn, num_key_value_heads)
         expert_mask = expert_mask.permute(2, 1, 0)  # (num_key_value_heads, top_k_attn, bsz * q_len)
         # Loop over all available experts in the model and perform the computation on each expert
         all_attn_weights = [] if output_attentions else None
-        for expert_idx in range(self.num_key_value_heads):
             # expert_mask[expert_idx]: (top_k_attn, bsz * q_len)
             # idx: the topk position. (selected_num)
             # top_x: token index. (selected_num)
@@ -911,7 +1246,7 @@ class MixtralAttentionMoE(MixtralAttention):
             key_states = self.k_proj[expert_idx](current_state)  # 🔍 specify expert
             value_states = self.v_proj[expert_idx](current_state)  # 🔍 specify expert
-            query_states = query_states.view(bsz, this_q_len, self.num_key_value_groups, self.head_dim).transpose(1, 2)  # 🔍 q_len -> this_q_len, num_heads -> num_key_value_groups
             key_states = key_states.view(bsz, this_q_len, 1, self.head_dim).transpose(1, 2)  # 🔍 q_len -> this_q_len, num_key_value_heads -> 1
             value_states = value_states.view(bsz, this_q_len, 1, self.head_dim).transpose(1, 2)  # 🔍 q_len -> this_q_len, num_key_value_heads -> 1
@@ -946,8 +1281,8 @@ class MixtralAttentionMoE(MixtralAttention):
             attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)  # softmax temperature
-            if attn_weights.size() != (bsz, self.num_key_value_groups, this_q_len, kv_seq_len):  # 🔍 q_len -> this_q_len, num_heads -> num_key_value_groups
-                raise ValueError(f"Attention weights should be of size {(bsz, self.num_key_value_groups, this_q_len, kv_seq_len)}, but is {attn_weights.size()}")
             # 🔍 create `current_attention_mask` with reduced `seq_len`
             # Notice that the `attention_mask` is passed intact during both training & generation, so we need to adjust the `top_x` by `past_key_values_length`.
@@ -961,11 +1296,12 @@ class MixtralAttentionMoE(MixtralAttention):
                     temp_attention_mask = attention_mask[:, previous_seen_tokens_total:].flatten()  # select along dimension 1 so that we get tokens in this iteration
                 else:
                     temp_attention_mask = attention_mask.flatten()  # flatten the dim
-                current_attention_mask[current_batch_ids, current_seq_ids] = temp_attention_mask[top_x]  # assign masks sparsely
             else:
                 current_attention_mask[current_batch_ids, current_seq_ids] = True  # assign masks sparsely
             if past_key_value is not None:  # 🔍 we need to update with cached attention mask
                 current_attention_mask = past_key_value.update_attention_mask(current_attention_mask, self.layer_idx, expert_idx)
@@ -983,17 +1319,17 @@ class MixtralAttentionMoE(MixtralAttention):
                 raise ValueError(f"Attention mask should be of size {(bsz, 1, this_q_len, kv_seq_len)}, but is {current_attention_mask.size()}")
             attn_weights = attn_weights + current_attention_mask  # 🔍
             # upcast attention to fp32
             attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
             attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training)
             attn_output = torch.matmul(attn_weights, value_states)
-            if attn_output.size() != (bsz, self.num_key_value_groups, this_q_len, self.head_dim):  # 🔍 q_len -> this_q_len, num_heads -> num_key_value_groups
-                raise ValueError(f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is {attn_output.size()}")
             attn_output = attn_output.transpose(1, 2).contiguous()
-            attn_output = attn_output.reshape(bsz, this_q_len, self.num_key_value_groups * self.head_dim)  # 🔍 q_len -> this_q_len, hidden_size -> num_key_value_groups * head_dim
             attn_output = self.o_proj[expert_idx](attn_output)
             # ---------------------------------------------- #
@@ -1026,27 +1362,16 @@ class MixtralAttentionMoE(MixtralAttention):
         # init
         attention_moe = MixtralAttentionMoE(config, layer_idx)
         # copy weights
-        num_key_value_groups = attention_moe.num_key_value_groups
         head_dim = attention_moe.head_dim
-        # attention
-        # q_proj: (self.hidden_size, self.num_heads * self.head_dim)
-        # k_proj: (self.hidden_size, self.num_key_value_heads * self.head_dim)
-        # v_proj: (self.hidden_size, self.num_key_value_heads * self.head_dim)
-        # o_proj: (self.num_heads * self.head_dim, self.hidden_size)
-        # attention_moe
-        # q_proj: (self.hidden_size, self.num_key_value_groups * self.head_dim)
-        # k_proj: (self.hidden_size, self.head_dim)
-        # v_proj: (self.hidden_size, self.head_dim)
-        # o_proj: (self.num_key_value_groups * self.head_dim, self.hidden_size)
-        for i in range(config.num_key_value_heads):
             indices_q_o = [j for j in range(head_dim * num_key_value_groups * i, head_dim * num_key_value_groups * (i + 1))]
-            indices_k_v = [j for j in range(head_dim * i, head_dim * (i + 1))]
-            # print(i, "indices_q_o", indices_q_o)
             # print(i, "indices_k_v", indices_k_v)
             attention_moe.q_proj[i].weight.data = attention.q_proj.weight.data[indices_q_o].clone()
@@ -1204,6 +1529,7 @@ class MixtralFlashAttention2(MixtralAttention):
         key_states = key_states.transpose(1, 2)
         value_states = value_states.transpose(1, 2)
         attn_output = self._flash_attention_forward(
             query_states,
             key_states,
@@ -1341,7 +1667,6 @@ class MixtralFlashAttention2(MixtralAttention):
         self, query_layer, key_layer, value_layer, attention_mask, query_length
     ):
         batch_size, kv_seq_len, num_heads, head_dim = key_layer.shape
         # On the first iteration we need to properly re-create the padding mask
         # by slicing it on the proper place
         if kv_seq_len != attention_mask.shape[-1]:
@@ -1389,6 +1714,517 @@ class MixtralFlashAttention2(MixtralAttention):
         )
 class MixtralBLockSparseTop2MLP(nn.Module):
     def __init__(self, config: MixtralConfig, ffn_dim, add_rescale_bias=False):  # 🔍
         super().__init__()
@@ -1419,7 +2255,7 @@ MISTRAL_ATTENTION_CLASSES = {
 # 🔍
 MISTRAL_ATTENTION_MOE_CLASSES = {
     "eager": MixtralAttentionMoE,
-    "flash_attention_2": None,
 }
@@ -1698,13 +2534,14 @@ class MixtralDecoderLayer(nn.Module):
         )
         self.use_attn_moe = config.use_attn_moe
         if self.is_moe:
-            attn_class = (
-                MISTRAL_ATTENTION_MOE_CLASSES[config._attn_implementation]
-                if self.use_attn_moe
-                else MISTRAL_ATTENTION_CLASSES[config._attn_implementation]
-            )
-            self.self_attn = attn_class(config, layer_idx)
             self.block_sparse_moe = MixtralSparseMoeBlock(config)
             self.mlp_residual = (
                 MixtralBLockSparseTop2MLP(config, config.intermediate_size_residual)
@@ -1713,8 +2550,6 @@ class MixtralDecoderLayer(nn.Module):
             )
         else:
-            attn_class = MISTRAL_ATTENTION_CLASSES[config._attn_implementation]
-            self.self_attn = attn_class(config, layer_idx)
             self.block_sparse_moe = MixtralBLockSparseTop2MLP(
                 config, config.intermediate_size * config.num_local_experts
             )
@@ -1766,7 +2601,7 @@ class MixtralDecoderLayer(nn.Module):
         hidden_states = self.input_layernorm(hidden_states)
         # 🔍 Self Attention
-        if self.is_moe and self.use_attn_moe:
             (
                 hidden_states,
                 self_attn_weights,
@@ -1795,18 +2630,18 @@ class MixtralDecoderLayer(nn.Module):
         # Fully Connected
         residual = hidden_states
-        hidden_states = self.post_attention_layernorm(hidden_states)
         # 🔍
         if self.is_moe:
-            hidden_states, router_logits = self.block_sparse_moe(hidden_states)
         else:
-            hidden_states = self.block_sparse_moe(hidden_states)
             router_logits = None
         if self.mlp_residual is not None:
-            # hidden_states += self.mlp_residual(hidden_states)  # 🔍
-            hidden_states = self.mlp_residual(hidden_states) + hidden_states  # 🔍
         hidden_states = residual + hidden_states
         outputs = (hidden_states,)
@@ -2223,7 +3058,7 @@ class MixtralForCausalLM(MixtralPreTrainedModel):
             if len(valid_attn_router_logits) > 0:  # exist logits that is not None
                 attn_aux_loss = load_balancing_loss_func(
                     valid_attn_router_logits,
-                    self.config.num_key_value_heads,
                     self.config.top_k_attn,
                     use_layer_wise_balance=self.config.use_layer_wise_balance,  # ✨
                 )
@@ -2632,7 +3467,8 @@ class MixtralForCausalLM(MixtralPreTrainedModel):
             if past is None:
                 if self.config.use_attn_moe:  # 🔍
                     model_kwargs["past_key_values"] = MoECache(
-                        self.config.num_key_value_heads
                     )
                 else:  # 🔍
                     model_kwargs["past_key_values"] = DynamicCache()

     is_torchdynamo_compiling,
 )
 from .configuration_mixtral import MixtralConfig
 logger = logging.get_logger(__name__)
     return is_flash_attn_2_available()
+@dataclass
+class AttentionMaskConverter:
+    """
+    A utility attention mask class that allows one to:
+        - Create a causal 4d mask
+        - Create a causal 4d mask with slided window
+        - Convert a 2d attention mask (batch_size, query_length) to a 4d attention mask (batch_size, 1, query_length,
+          key_value_length) that can be multiplied with attention scores
+    Examples:
+    ```python
+    >>> import torch
+    >>> from transformers.modeling_attn_mask_utils import AttentionMaskConverter
+    >>> converter = AttentionMaskConverter(True)
+    >>> converter.to_4d(torch.tensor([[0, 0, 0, 1, 1]]), 5, key_value_length=5, dtype=torch.float32)
+    tensor([[[[-3.4028e+38, -3.4028e+38, -3.4028e+38, -3.4028e+38, -3.4028e+38],
+            [-3.4028e+38, -3.4028e+38, -3.4028e+38, -3.4028e+38, -3.4028e+38],
+            [-3.4028e+38, -3.4028e+38, -3.4028e+38, -3.4028e+38, -3.4028e+38],
+            [-3.4028e+38, -3.4028e+38, -3.4028e+38,  0.0000e+00, -3.4028e+38],
+            [-3.4028e+38, -3.4028e+38, -3.4028e+38,  0.0000e+00,  0.0000e+00]]]])
+    ```
+    Parameters:
+        is_causal (`bool`):
+            Whether the attention mask should be a uni-directional (causal) or bi-directional mask.
+        sliding_window (`int`, *optional*):
+            Optionally, the sliding window masks can be created if `sliding_window` is defined to a positive integer.
+    """
+    is_causal: bool
+    sliding_window: int
+    def __init__(self, is_causal: bool, sliding_window: Optional[int] = None):
+        self.is_causal = is_causal
+        self.sliding_window = sliding_window
+        if self.sliding_window is not None and self.sliding_window <= 0:
+            raise ValueError(
+                f"Make sure that when passing `sliding_window` that its value is a strictly positive integer, not `{self.sliding_window}`"
+            )
+    def to_causal_4d(
+        self,
+        batch_size: int,
+        query_length: int,
+        key_value_length: int,
+        dtype: torch.dtype,
+        device: Union[torch.device, "str"] = "cpu",
+    ) -> Optional[torch.Tensor]:
+        """
+        Creates a causal 4D mask of (bsz, head_dim=1, query_length, key_value_length) shape and adds large negative
+        bias to upper right hand triangular matrix (causal mask).
+        """
+        if not self.is_causal:
+            raise ValueError(
+                f"Please use `to_causal_4d` only if {self.__class__} has `is_causal` set to True."
+            )
+        # If shape is not cached, create a new causal mask and cache it
+        input_shape = (batch_size, query_length)
+        past_key_values_length = key_value_length - query_length
+        # create causal mask
+        # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
+        causal_4d_mask = None
+        if input_shape[-1] > 1 or self.sliding_window is not None:
+            causal_4d_mask = self._make_causal_mask(
+                input_shape,
+                dtype,
+                device=device,
+                past_key_values_length=past_key_values_length,
+                sliding_window=self.sliding_window,
+            )
+        return causal_4d_mask
+    def to_4d(
+        self,
+        attention_mask_2d: torch.Tensor,
+        query_length: int,
+        dtype: torch.dtype,
+        key_value_length: Optional[int] = None,
+    ) -> torch.Tensor:
+        """
+        Converts 2D attention mask to 4D attention mask by expanding mask to (bsz, head_dim=1, query_length,
+        key_value_length) shape and by adding a large negative bias to not-attended positions. If attention_mask is
+        causal, a causal mask will be added.
+        """
+        input_shape = (attention_mask_2d.shape[0], query_length)
+        # create causal mask
+        # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
+        causal_4d_mask = None
+        if (input_shape[-1] > 1 or self.sliding_window is not None) and self.is_causal:
+            if key_value_length is None:
+                raise ValueError(
+                    "This attention mask converter is causal. Make sure to pass `key_value_length` to correctly create a causal mask."
+                )
+            past_key_values_length = key_value_length - query_length
+            causal_4d_mask = self._make_causal_mask(
+                input_shape,
+                dtype,
+                device=attention_mask_2d.device,
+                past_key_values_length=past_key_values_length,
+                sliding_window=self.sliding_window,
+            )
+        elif self.sliding_window is not None:
+            raise NotImplementedError(
+                "Sliding window is currently only implemented for causal masking"
+            )
+        # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
+        expanded_attn_mask = self._expand_mask(
+            attention_mask_2d, dtype, tgt_len=input_shape[-1]
+        ).to(attention_mask_2d.device)
+        if causal_4d_mask is not None:
+            expanded_attn_mask = causal_4d_mask.masked_fill(
+                expanded_attn_mask.bool(), torch.finfo(dtype).min
+            )
+        # expanded_attn_mask + causal_4d_mask can cause some overflow
+        expanded_4d_mask = expanded_attn_mask
+        return expanded_4d_mask
+    @staticmethod
+    def _make_causal_mask(
+        input_ids_shape: torch.Size,
+        dtype: torch.dtype,
+        device: torch.device,
+        past_key_values_length: int = 0,
+        sliding_window: Optional[int] = None,
+    ):
+        """
+        Make causal mask used for bi-directional self-attention.
+        """
+        bsz, tgt_len = input_ids_shape
+        mask = torch.full((tgt_len, tgt_len), torch.finfo(dtype).min, device=device)
+        mask_cond = torch.arange(mask.size(-1), device=device)
+        mask.masked_fill_(mask_cond < (mask_cond + 1).view(mask.size(-1), 1), 0)
+        mask = mask.to(dtype)
+        if past_key_values_length > 0:
+            mask = torch.cat(
+                [
+                    torch.zeros(
+                        tgt_len, past_key_values_length, dtype=dtype, device=device
+                    ),
+                    mask,
+                ],
+                dim=-1,
+            )
+        # add lower triangular sliding window mask if necessary
+        if sliding_window is not None:
+            diagonal = past_key_values_length - sliding_window + 1
+            context_mask = 1 - torch.triu(
+                torch.ones_like(mask, dtype=torch.int), diagonal=diagonal
+            )
+            mask.masked_fill_(context_mask.bool(), torch.finfo(dtype).min)
+        return mask[None, None, :, :].expand(
+            bsz, 1, tgt_len, tgt_len + past_key_values_length
+        )
+    @staticmethod
+    def _expand_mask(
+        mask: torch.Tensor, dtype: torch.dtype, tgt_len: Optional[int] = None
+    ):
+        """
+        Expands attention_mask from `[bsz, seq_len]` to `[bsz, 1, tgt_seq_len, src_seq_len]`.
+        """
+        bsz, src_len = mask.size()
+        tgt_len = tgt_len if tgt_len is not None else src_len
+        expanded_mask = (
+            mask[:, None, None, :].expand(bsz, 1, tgt_len, src_len).to(dtype)
+        )
+        inverted_mask = 1.0 - expanded_mask
+        return inverted_mask.masked_fill(
+            inverted_mask.to(torch.bool), torch.finfo(dtype).min
+        )
+    @staticmethod
+    def _unmask_unattended(
+        expanded_mask: torch.Tensor,
+        attention_mask: torch.Tensor,
+        unmasked_value: Union[bool, float],
+    ):
+        # fmt: off
+        """
+        Attend to all tokens in masked rows from the expanded attention mask, for example the relevant first rows when
+        using left padding. This is required by F.scaled_dot_product_attention memory-efficient attention path.
+        Details: https://github.com/pytorch/pytorch/issues/110213
+        `expanded_mask` is [bsz, num_masks, tgt_seq_len, src_seq_len] or [bsz, tgt_seq_len, src_seq_len].
+        `attention_mask` is [bsz, src_seq_len].
+        The dimension num_masks of `expanded_mask` is most often 1, but it can also be the number of heads in the case of alibi attention bias.
+        For example, if `attention_mask` is
+        ```
+        [[0, 0, 1],
+         [1, 1, 1],
+         [0, 1, 1]]
+        ```
+        and `expanded_mask` is (e.g. here left-padding case)
+        ```
+        [[[[0, 0, 0],
+           [0, 0, 0],
+           [0, 0, 1]]],
+         [[[1, 0, 0],
+           [1, 1, 0],
+           [1, 1, 1]]],
+         [[[0, 0, 0],
+           [0, 1, 0],
+           [0, 1, 1]]]]
+        ```
+        then the modified `expanded_mask` will be
+        ```
+        [[[[1, 1, 1],   <-- modified
+           [1, 1, 1],   <-- modified
+           [0, 0, 1]]],
+         [[[1, 0, 0],
+           [1, 1, 0],
+           [1, 1, 1]]],
+         [[[1, 1, 1],   <-- modified
+           [0, 1, 0],
+           [0, 1, 1]]]]
+        ```
+        """
+        # fmt: on
+        # Get the index of the first non-zero value for every sample in the batch.
+        # In the above example, indices = [[2], [0], [1]]]
+        tmp = torch.arange(attention_mask.shape[1], 0, -1)
+        indices = torch.argmax(attention_mask.cpu() * tmp, 1, keepdim=True)
+        # Find the batch indexes that have unattended tokens on the leftmost side (e.g. [0, 0, 1, 1, 1]), for which the first rows of the
+        # expanded mask will be completely unattended.
+        left_masked_rows = torch.where(indices > 0)[0]
+        if left_masked_rows.shape[0] == 0:
+            return expanded_mask
+        indices = indices[left_masked_rows]
+        max_len = torch.max(indices)
+        range_tensor = torch.arange(max_len).unsqueeze(0)
+        range_tensor = range_tensor.repeat(indices.size(0), 1)
+        # Avoid unmasking tokens at relevant target positions (on the row axis), by rather unmasking possibly several times the first row that should always be unmasked as we filtered out the batch above.
+        range_tensor[range_tensor >= indices] = 0
+        # TODO: we may drop support for 3D attention mask as the refactor from Patrick maybe dropped this case
+        if expanded_mask.dim() == 4:
+            num_masks = expanded_mask.shape[1]
+            if num_masks == 1:
+                # Broadcast [left_masked_rows, 1], [left_masked_rows, max_len]
+                mask_slice = (left_masked_rows[:, None], 0, range_tensor)
+            else:
+                # Broadcast [left_masked_rows, 1, 1], [1, num_masks, 1], [left_masked_rows, 1, max_len]
+                mask_slice = (
+                    left_masked_rows[:, None, None],
+                    torch.arange(num_masks)[None, :, None],
+                    range_tensor[:, None, :],
+                )
+        else:
+            # Broadcast [left_masked_rows, 1], [left_masked_rows, max_len]
+            mask_slice = (left_masked_rows[:, None], range_tensor)
+        expanded_mask[mask_slice] = unmasked_value
+        return expanded_mask
+def _prepare_4d_causal_attention_mask(
+    attention_mask: Optional[torch.Tensor],
+    input_shape: Union[torch.Size, Tuple, List],
+    inputs_embeds: torch.Tensor,
+    past_key_values_length: int,
+    sliding_window: Optional[int] = None,
+):
+    """
+    Creates a causal 4D mask of shape `(batch_size, 1, query_length, key_value_length)` from a 2D mask of shape
+    `(batch_size, key_value_length)`
+    Args:
+        attention_mask (`torch.Tensor` or `None`):
+            A 2D attention mask of shape `(batch_size, key_value_length)`
+        input_shape (`tuple(int)` or `list(int)` or `torch.Size`):
+            The input shape should be a tuple that defines `(batch_size, query_length)`.
+        inputs_embeds (`torch.Tensor`):
+            The embedded inputs as a torch Tensor.
+        past_key_values_length (`int`):
+            The length of the key value cache.
+        sliding_window (`int`, *optional*):
+            If the model uses windowed attention, a sliding window should be passed.
+    """
+    attn_mask_converter = AttentionMaskConverter(
+        is_causal=True, sliding_window=sliding_window
+    )
+    key_value_length = input_shape[-1] + past_key_values_length
+    # 4d mask is passed through the layers
+    if attention_mask is not None:
+        attention_mask = attn_mask_converter.to_4d(
+            attention_mask,
+            input_shape[-1],
+            key_value_length=key_value_length,
+            dtype=inputs_embeds.dtype,
+        )
+    else:
+        attention_mask = attn_mask_converter.to_causal_4d(
+            input_shape[0],
+            input_shape[-1],
+            key_value_length,
+            dtype=inputs_embeds.dtype,
+            device=inputs_embeds.device,
+        )
+    return attention_mask
 @dataclass
 class MoeCausalLMOutputWithPast(ModelOutput):
     """
     Returns:
         The auxiliary loss.
     """
+    if gate_logits is None or (isinstance(gate_logits, Iterable) and len(gate_logits) == 0):
         return 0
     # ✨ Here is the fix for balance loss in Mixtral.
             )
         # 🔍
         self.softmax = nn.Softmax(dim=-1)
         self.top_k_attn = config.top_k_attn
+        self.attn_experts = config.attn_experts
         self.scale_factor_attn = config.scale_factor_attn
+        self.split_ratio = self.attn_experts // self.num_key_value_heads
+        self.gate = nn.Linear(self.hidden_size, self.attn_experts, bias=False)
         # 🔍
+        self.q_proj = nn.ModuleList([nn.Linear(self.hidden_size, self.num_key_value_groups * self.head_dim // self.split_ratio, bias=False) for _ in range(self.attn_experts)])
+        self.k_proj = nn.ModuleList([nn.Linear(self.hidden_size, self.head_dim, bias=False) for _ in range(self.attn_experts)])
+        self.v_proj = nn.ModuleList([nn.Linear(self.hidden_size, self.head_dim, bias=False) for _ in range(self.attn_experts)])
+        self.o_proj = nn.ModuleList([nn.Linear(self.num_key_value_groups * self.head_dim // self.split_ratio, self.hidden_size, bias=config.add_rescale_bias) for _ in range(self.attn_experts)])  # 🔍 (may add bias for rescaling)
         self.rotary_emb = MixtralRotaryEmbedding(
             self.head_dim,
             raise TypeError(
                 "`past_key_value` must be a `MoECache` instance for attention MoE!"
             )
+        # print("attention_mask", attention_mask, attention_mask.shape)
         device = hidden_states.device
         dtype = hidden_states.dtype
         bsz, q_len, hidden_dim = hidden_states.size()
         # One hot encode the selected experts to create an expert mask
         # this will be used to easily index which expert is going to be sollicitated
+        expert_mask = torch.nn.functional.one_hot(selected_experts, num_classes=self.attn_experts)  # (bsz * q_len, top_k_attn, num_key_value_heads)
         expert_mask = expert_mask.permute(2, 1, 0)  # (num_key_value_heads, top_k_attn, bsz * q_len)
         # Loop over all available experts in the model and perform the computation on each expert
         all_attn_weights = [] if output_attentions else None
+        for expert_idx in range(self.attn_experts):
             # expert_mask[expert_idx]: (top_k_attn, bsz * q_len)
             # idx: the topk position. (selected_num)
             # top_x: token index. (selected_num)
             key_states = self.k_proj[expert_idx](current_state)  # 🔍 specify expert
             value_states = self.v_proj[expert_idx](current_state)  # 🔍 specify expert
+            query_states = query_states.view(bsz, this_q_len, self.num_key_value_groups // self.split_ratio, self.head_dim).transpose(1, 2)  # 🔍 q_len -> this_q_len, num_heads -> num_key_value_groups
             key_states = key_states.view(bsz, this_q_len, 1, self.head_dim).transpose(1, 2)  # 🔍 q_len -> this_q_len, num_key_value_heads -> 1
             value_states = value_states.view(bsz, this_q_len, 1, self.head_dim).transpose(1, 2)  # 🔍 q_len -> this_q_len, num_key_value_heads -> 1
             attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)  # softmax temperature
+            if attn_weights.size() != (bsz, self.num_key_value_groups // self.split_ratio, this_q_len, kv_seq_len):  # 🔍 q_len -> this_q_len, num_heads -> num_key_value_groups
+                raise ValueError(f"Attention weights should be of size {(bsz, self.num_key_value_groups // self.split_ratio, this_q_len, kv_seq_len)}, but is {attn_weights.size()}")
             # 🔍 create `current_attention_mask` with reduced `seq_len`
             # Notice that the `attention_mask` is passed intact during both training & generation, so we need to adjust the `top_x` by `past_key_values_length`.
                     temp_attention_mask = attention_mask[:, previous_seen_tokens_total:].flatten()  # select along dimension 1 so that we get tokens in this iteration
                 else:
                     temp_attention_mask = attention_mask.flatten()  # flatten the dim
+                current_attention_mask[current_batch_ids, current_seq_ids] = temp_attention_mask[top_x].bool()  # assign masks sparsely
             else:
                 current_attention_mask[current_batch_ids, current_seq_ids] = True  # assign masks sparsely
+            # print("current_attention_mask", current_attention_mask, current_attention_mask.shape)
             if past_key_value is not None:  # 🔍 we need to update with cached attention mask
                 current_attention_mask = past_key_value.update_attention_mask(current_attention_mask, self.layer_idx, expert_idx)
                 raise ValueError(f"Attention mask should be of size {(bsz, 1, this_q_len, kv_seq_len)}, but is {current_attention_mask.size()}")
             attn_weights = attn_weights + current_attention_mask  # 🔍
+            # print("current_attention_mask", current_attention_mask.shape, current_attention_mask[0])
             # upcast attention to fp32
             attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
             attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training)
             attn_output = torch.matmul(attn_weights, value_states)
+            # if attn_output.size() != (bsz, self.num_key_value_groups // self.split_ratio, this_q_len, self.head_dim):  # 🔍 q_len -> this_q_len, num_heads -> num_key_value_groups
+                # raise ValueError(f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is {attn_output.size()}")
             attn_output = attn_output.transpose(1, 2).contiguous()
+            attn_output = attn_output.reshape(bsz, this_q_len, self.num_key_value_groups * self.head_dim // self.split_ratio)  # 🔍 q_len -> this_q_len, hidden_size -> num_key_value_groups * head_dim
             attn_output = self.o_proj[expert_idx](attn_output)
             # ---------------------------------------------- #
         # init
         attention_moe = MixtralAttentionMoE(config, layer_idx)
+        split = 1  # split the hidden_size, support split=1 --> 8/2, split=2 --> 16/4, split=4 --> 32/8
         # copy weights
+        num_key_value_groups = attention_moe.num_key_value_groups // split
         head_dim = attention_moe.head_dim
+        for i in range(config.num_key_value_heads * split):
             indices_q_o = [j for j in range(head_dim * num_key_value_groups * i, head_dim * num_key_value_groups * (i + 1))]
+            indices_k_v = [j for j in range(head_dim * (i // split), head_dim * ((i // split) + 1))]
+            print(i, "indices_q_o", indices_q_o)
             # print(i, "indices_k_v", indices_k_v)
             attention_moe.q_proj[i].weight.data = attention.q_proj.weight.data[indices_q_o].clone()
         key_states = key_states.transpose(1, 2)
         value_states = value_states.transpose(1, 2)
+        # print("attention_mask", attention_mask, attention_mask.shape)
         attn_output = self._flash_attention_forward(
             query_states,
             key_states,
         self, query_layer, key_layer, value_layer, attention_mask, query_length
     ):
         batch_size, kv_seq_len, num_heads, head_dim = key_layer.shape
         # On the first iteration we need to properly re-create the padding mask
         # by slicing it on the proper place
         if kv_seq_len != attention_mask.shape[-1]:
         )
+class MixtralFlashAttention2MoE(MixtralFlashAttention2):
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+        self.top_k_attn = self.config.top_k_attn
+        self.attn_experts = self.config.attn_experts
+        self.scale_factor_attn = self.config.scale_factor_attn
+        self.split_ratio = self.attn_experts // self.num_key_value_heads
+        self.gate = nn.Linear(self.hidden_size, self.attn_experts, bias=False)
+        self.q_proj = nn.ModuleList([nn.Linear(self.hidden_size, self.num_key_value_groups * self.head_dim // self.split_ratio, bias=False) for _ in range(self.attn_experts)])
+        self.k_proj = nn.ModuleList([nn.Linear(self.hidden_size, self.head_dim, bias=False) for _ in range(self.attn_experts)])
+        self.v_proj = nn.ModuleList([nn.Linear(self.hidden_size, self.head_dim, bias=False) for _ in range(self.attn_experts)])
+        self.o_proj = nn.ModuleList([nn.Linear(self.num_key_value_groups * self.head_dim // self.split_ratio, self.hidden_size, bias=self.config.add_rescale_bias) for _ in range(self.attn_experts)])
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_value: Optional[Cache] = None,
+        output_attentions: bool = False,
+        use_cache: bool = False,
+        **kwargs,
+    ):
+        if "padding_mask" in kwargs:
+            warnings.warn(
+                "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`"
+            )
+            # overwrite attention_mask with padding_mask
+            # attention_mask = kwargs.pop("padding_mask")
+        if past_key_value is not None and not isinstance(past_key_value, MoECache):  # 🔍 type check
+            raise TypeError(
+                "`past_key_value` must be a `MoECache` instance for attention MoE!"
+            )
+        bsz, q_len, hidden_dim = hidden_states.size()
+        device = hidden_states.device
+        dtype = hidden_states.dtype
+        hidden_states = hidden_states.reshape(-1, hidden_dim)
+        # gate compute
+        router_logits = self.gate(hidden_states)
+        router_scores = F.softmax(router_logits, dim=1, dtype=torch.float)
+        routing_weights, selected_experts = torch.topk(router_scores, self.top_k_attn, dim=-1)
+        routing_weights /= routing_weights.sum(dim=-1, keepdim=True)
+        routing_weights = routing_weights.to(dtype)
+        final_attn_output = torch.zeros_like(hidden_states).reshape(-1, hidden_dim)
+        expert_mask = F.one_hot(selected_experts, num_classes=self.num_heads).permute(2, 1, 0)
+        all_attn_weights = [] if output_attentions else None
+        for expert_idx in range(self.attn_experts):
+            idx, top_x = torch.nonzero(expert_mask[expert_idx], as_tuple=True)
+            # top_x_list = top_x.tolist()
+            # idx_list = idx.tolist()
+            if top_x.shape[0] == 0 and not self.training:  # skip during training will lead to asynchrony among different GPUs and blocks the training!
+                if output_attentions:
+                    all_attn_weights.append(None)
+                continue
+            # create position_ids for selected tokens
+            current_batch_ids = (top_x // q_len)
+            each_batch_selected_token_num = torch.bincount(current_batch_ids, minlength=bsz)  # (bsz)
+            this_q_len = each_batch_selected_token_num.max().item()
+            selection_mask = torch.zeros((bsz * q_len,), device=device, dtype=torch.bool)
+            selection_mask[top_x] = True
+            selection_mask = selection_mask.reshape(bsz, q_len)
+            token_position_indices = torch.cumsum(selection_mask, dim=1) - 1
+            token_position_indices = token_position_indices.flatten()
+            current_seq_ids = token_position_indices[top_x]
+            # 🔍 initialize hidden_states for this expert
+            current_state = torch.zeros((bsz, this_q_len, hidden_dim), dtype=dtype, device=device)
+            current_state[current_batch_ids, current_seq_ids] = hidden_states[top_x]  # assign tokens sparsely
+            # for attention forward
+            # expert_inputs = viewed_hidden_states[None, top_x_list].reshape(-1, self.hidden_size)
+            query_states = self.q_proj[expert_idx](current_state)
+            key_states = self.k_proj[expert_idx](current_state)
+            value_states = self.v_proj[expert_idx](current_state)
+            # seq_len = query_states.numel() // (bsz * self.num_key_value_groups * self.head_dim)
+            query_states = query_states.view(bsz, -1, self.num_key_value_groups // self.split_ratio, self.head_dim).transpose(1, 2)
+            key_states = key_states.view(bsz, -1, 1, self.head_dim).transpose(1, 2)
+            value_states = value_states.view(bsz, -1, 1, self.head_dim).transpose(1, 2)
+            # for moe kv cache
+            past_key_values_length = 0
+            kv_seq_len = key_states.shape[-2]
+            if past_key_value is not None:
+                if self.layer_idx is None:
+                    raise ValueError(
+                        f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} "
+                        "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class "
+                        "with a layer index."
+                    )
+                past_key_values_length = past_key_value.get_usable_length(kv_seq_len, self.layer_idx, expert_idx)  # 🔍 specify expert index
+                kv_seq_len += past_key_values_length
+            current_position_ids = torch.zeros((bsz, this_q_len), device=hidden_states.device, dtype=torch.long)
+            current_position_ids[current_batch_ids, current_seq_ids] = position_ids.expand(bsz, q_len).flatten()[top_x]
+            if top_x.shape[0] > 0:  # apply only when there are tokens
+                cos, sin = self.rotary_emb(value_states, seq_len=current_position_ids.max().item() + 1)  # 🔍 adjust the seq_len to the maximum possible value
+                query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, current_position_ids)
+            if past_key_value is not None:
+                cache_kwargs = {"sin": sin, "cos": cos}  # Specific to RoPE models
+                key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, expert_idx, cache_kwargs)  # 🔍 specify expert index
+            # print("attention_mask", attention_mask.shape, attention_mask)
+            # for current attention mask
+            '''
+            current_attention_mask = torch.zeros((bsz, this_q_len), dtype=torch.bool, device=device)
+            if attention_mask is not None:
+                if past_key_values_length > 0:  # 🔍 we need to exclude previous tokens
+                    previous_seen_tokens_total = past_key_value._seen_tokens_total - q_len
+                    temp_attention_mask = attention_mask[:, previous_seen_tokens_total:].flatten()  # select along dimension 1 so that we get tokens in this iteration
+                else:
+                    temp_attention_mask = attention_mask.flatten()  # flatten the dim
+                current_attention_mask[current_batch_ids, current_seq_ids] = temp_attention_mask[top_x]  # bug here !!!
+            else:
+                current_attention_mask[current_batch_ids, current_seq_ids] = True  # assign masks sparsely
+            if past_key_value is not None:  # 🔍 we need to update with cached attention mask
+                current_attention_mask = past_key_value.update_attention_mask(current_attention_mask, self.layer_idx, expert_idx)
+            current_attention_mask = _prepare_4d_causal_attention_mask(
+                current_attention_mask,
+                (bsz, this_q_len),
+                current_state,
+                past_key_values_length,
+                sliding_window=self.config.sliding_window,
+            )
+            if current_attention_mask.size() != (bsz, 1, this_q_len, kv_seq_len):  # 🔍 q_len -> this_q_len
+                raise ValueError(f"Attention mask should be of size {(bsz, 1, this_q_len, kv_seq_len)}, but is {current_attention_mask.size()}")
+            '''
+            # for sliding window
+            use_sliding_windows = (
+                _flash_supports_window_size
+                and getattr(self.config, "sliding_window", None) is not None
+                and kv_seq_len > self.config.sliding_window
+            )
+            if not _flash_supports_window_size:
+                logger.warning_once(
+                    "The current flash attention version does not support sliding window attention, for a more memory efficient implementation"
+                    " make sure to upgrade flash-attn library."
+                )
+            # wait for change! sliding_window=4096
+            if past_key_value is not None:
+                # Activate slicing cache only if the config has a value `sliding_windows` attribute
+                cache_has_contents = past_key_value.get_seq_length(self.layer_idx) > 0
+                if (
+                    getattr(self.config, "sliding_window", None) is not None
+                    and kv_seq_len > self.config.sliding_window
+                    and cache_has_contents
+                ):
+                    slicing_tokens = 1 - self.config.sliding_window
+                    past_key = past_key_value[self.layer_idx][0]
+                    past_value = past_key_value[self.layer_idx][1]
+                    past_key = past_key[:, :, slicing_tokens:, :].contiguous()
+                    past_value = past_value[:, :, slicing_tokens:, :].contiguous()
+                    if past_key.shape[-2] != self.config.sliding_window - 1:
+                        raise ValueError(
+                            f"past key must have a shape of (`batch_size, num_heads, self.config.sliding_window-1, head_dim`), got"
+                            f" {past_key.shape}"
+                        )
+                    if attention_mask is not None:
+                        attention_mask = attention_mask[:, slicing_tokens:]
+                        attention_mask = torch.cat(
+                            [attention_mask, torch.ones_like(attention_mask[:, -1:])],
+                            dim=-1,
+                        )
+                cache_kwargs = {"sin": sin, "cos": cos}  # Specific to RoPE models
+                key_states, value_states = past_key_value.update(
+                    key_states, value_states, self.layer_idx, cache_kwargs
+                )
+            # for input dtype
+            input_dtype = query_states.dtype
+            if input_dtype == torch.float32:
+                # Handle the case where the model is quantized
+                if hasattr(self.config, "_pre_quantization_dtype"):
+                    target_dtype = self.config._pre_quantization_dtype
+                else:
+                    target_dtype = self.q_proj[0].weight.dtype
+                logger.warning_once(
+                    f"The input hidden states seems to be silently casted in float32, this might be related to"
+                    f" the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in"
+                    f" {target_dtype}."
+                )
+                query_states = query_states.to(target_dtype)
+                key_states = key_states.to(target_dtype)
+                value_states = value_states.to(target_dtype)
+            dropout_rate = 0.0 if not self.training else self.attention_dropout
+            repeat_num = query_states.shape[1]
+            key_states = repeat_kv(key_states, repeat_num)
+            value_states = repeat_kv(value_states, repeat_num)
+            # print("repeat_num", repeat_num)
+            # print("query_states shape", query_states.shape, key_states.shape, value_states.shape)
+            # Reashape to the expected shape for Flash Attention
+            query_states = query_states.transpose(1, 2)
+            key_states = key_states.transpose(1, 2)
+            value_states = value_states.transpose(1, 2)
+            attn_output = self._flash_attention_forward(
+                query_states,
+                key_states,
+                value_states,
+                attention_mask,
+                this_q_len,
+                dropout=dropout_rate,
+                use_sliding_windows=use_sliding_windows,
+            )
+            attn_output = attn_output.reshape(bsz, this_q_len, self.num_key_value_groups * self.head_dim // self.split_ratio).contiguous()
+            attn_output = self.o_proj[expert_idx](attn_output)
+            attn_output = attn_output[current_batch_ids, current_seq_ids] * (routing_weights[top_x, idx, None] * self.scale_factor_attn)
+            final_attn_output.index_add_(0, top_x, attn_output)
+        final_attn_output = final_attn_output.reshape(bsz, q_len, hidden_dim)
+        if not output_attentions:
+            attn_weights = None
+        return final_attn_output, attn_weights, past_key_value, router_logits  # 🔍 return an extra `router_logits`
+class MixtralFlashAttention2MoE_zt(MixtralFlashAttention2):
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+        self.top_k_attn = self.config.top_k_attn
+        self.scale_factor_attn = self.config.scale_factor_attn
+        # self.num_heads
+        # self.head_dim
+        # self.num_key_value_heads
+        # self.num_key_value_groups  # total number of experts
+        assert self.top_k_attn <= self.num_key_value_groups
+        # assert self.top_k_attn % self.num_key_value_heads == 0
+        self.attn_hsz = self.hidden_size // self.num_key_value_groups * self.top_k_attn
+        self.kv_repeat_num = self.attn_hsz // (self.num_key_value_heads * self.head_dim)
+        self.simulated_attn_head_num = self.attn_hsz // self.head_dim
+        assert self.attn_hsz % (self.num_key_value_heads * self.head_dim) == 0
+        assert self.simulated_attn_head_num == self.num_heads * (self.top_k_attn / self.num_key_value_groups)
+        assert self.kv_repeat_num * self.num_key_value_heads == self.simulated_attn_head_num
+        self.gate = nn.Linear(self.hidden_size, self.num_key_value_groups, bias=False)
+        # tzhu: there are self.num_key_value_groups experts
+        #       each expert has a size of self.attn_hsz
+        self.q_proj = nn.ModuleList(
+            [nn.Linear(self.hidden_size, self.attn_hsz) for _ in range(self.num_key_value_groups)]
+        )
+        self.o_proj = nn.ModuleList(
+            [nn.Linear(self.attn_hsz, self.hidden_size) for _ in range(self.num_key_value_groups)]
+        )
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_value: Optional[Cache] = None,
+        output_attentions: bool = False,
+        use_cache: bool = False,
+        **kwargs,
+    ):
+        if "padding_mask" in kwargs:
+            warnings.warn(
+                "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`"
+            )
+            # overwrite attention_mask with padding_mask
+            attention_mask = kwargs.pop("padding_mask")
+        bsz, q_len, _ = hidden_states.size()
+        key_states = self.k_proj(hidden_states)
+        value_states = self.v_proj(hidden_states)
+        # tzhu: attn-moe on q_proj
+        viewed_hidden_states = hidden_states.view(bsz * q_len, self.hidden_size)
+        # router
+        router_logits = self.gate(viewed_hidden_states)
+        router_scores = F.softmax(router_logits, dim=-1, dtype=torch.float)
+        routing_weights, selected_experts = torch.topk(router_scores, self.top_k_attn, dim=-1)
+        routing_weights /= routing_weights.sum(dim=-1, keepdim=True)
+        routing_weights = routing_weights.to(hidden_states.dtype)
+        query_states = torch.zeros(
+            (bsz * q_len, self.attn_hsz),
+            dtype=hidden_states.dtype,
+            device=hidden_states.device,
+        )
+        # expert_mask: (num_experts, top_k_attn, bsz * q_len)
+        expert_mask = F.one_hot(selected_experts, num_classes=self.num_heads).permute(2, 1, 0)
+        for expert_idx in range(self.num_key_value_groups):
+            expert_layer = self.q_proj[expert_idx]
+            idx, top_x = torch.where(expert_mask[expert_idx])
+            top_x_list = top_x.tolist()
+            idx_list = idx.tolist()
+            expert_inputs = viewed_hidden_states[None, top_x_list].reshape(-1, self.hidden_size)
+            # inputs (-1, hidden_size) -> outputs (-1, attn_hsz)
+            expert_outs = expert_layer(expert_inputs) * routing_weights[top_x_list, idx_list, None] * self.scale_factor_attn
+            query_states.index_add_(0, top_x, expert_outs.to(query_states.dtype))
+        query_states = query_states.view(bsz, q_len, self.attn_hsz)
+        # query_states = query_states.view(
+        #     bsz, q_len, self.num_heads, self.simulated_attn_head_num
+        # ).transpose(1, 2)
+        query_states = query_states.view(
+            bsz, q_len, self.simulated_attn_head_num, self.head_dim
+        ).transpose(1, 2)
+        key_states = key_states.view(
+            bsz, q_len, self.num_key_value_heads, self.head_dim
+        ).transpose(1, 2)
+        value_states = value_states.view(
+            bsz, q_len, self.num_key_value_heads, self.head_dim
+        ).transpose(1, 2)
+        kv_seq_len = key_states.shape[-2]
+        if past_key_value is not None:
+            if self.layer_idx is None:
+                raise ValueError(
+                    f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} "
+                    "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class "
+                    "with a layer index."
+                )
+            kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx)
+        # Because the input can be padded, the absolute sequence length depends on the max position id.
+        rotary_seq_len = max(kv_seq_len, position_ids[:, -1].max().item()) + 1
+        cos, sin = self.rotary_emb(value_states, seq_len=rotary_seq_len)
+        query_states, key_states = apply_rotary_pos_emb(
+            query_states, key_states, cos, sin, position_ids
+        )
+        use_sliding_windows = (
+            _flash_supports_window_size
+            and getattr(self.config, "sliding_window", None) is not None
+            and kv_seq_len > self.config.sliding_window
+        )
+        if not _flash_supports_window_size:
+            logger.warning_once(
+                "The current flash attention version does not support sliding window attention, for a more memory efficient implementation"
+                " make sure to upgrade flash-attn library."
+            )
+        if past_key_value is not None:
+            # Activate slicing cache only if the config has a value `sliding_windows` attribute
+            cache_has_contents = past_key_value.get_seq_length(self.layer_idx) > 0
+            if (
+                getattr(self.config, "sliding_window", None) is not None
+                and kv_seq_len > self.config.sliding_window
+                and cache_has_contents
+            ):
+                slicing_tokens = 1 - self.config.sliding_window
+                past_key = past_key_value[self.layer_idx][0]
+                past_value = past_key_value[self.layer_idx][1]
+                past_key = past_key[:, :, slicing_tokens:, :].contiguous()
+                past_value = past_value[:, :, slicing_tokens:, :].contiguous()
+                if past_key.shape[-2] != self.config.sliding_window - 1:
+                    raise ValueError(
+                        f"past key must have a shape of (`batch_size, num_heads, self.config.sliding_window-1, head_dim`), got"
+                        f" {past_key.shape}"
+                    )
+                if attention_mask is not None:
+                    attention_mask = attention_mask[:, slicing_tokens:]
+                    attention_mask = torch.cat(
+                        [attention_mask, torch.ones_like(attention_mask[:, -1:])],
+                        dim=-1,
+                    )
+            cache_kwargs = {"sin": sin, "cos": cos}  # Specific to RoPE models
+            key_states, value_states = past_key_value.update(
+                key_states, value_states, self.layer_idx, cache_kwargs
+            )
+        # repeat k/v heads if n_kv_heads < n_heads
+        key_states = repeat_kv(key_states, self.kv_repeat_num)
+        value_states = repeat_kv(value_states, self.kv_repeat_num)
+        dropout_rate = 0.0 if not self.training else self.attention_dropout
+        # In PEFT, usually we cast the layer norms in float32 for training stability reasons
+        # therefore the input hidden states gets silently casted in float32. Hence, we need
+        # cast them back in float16 just to be sure everything works as expected.
+        input_dtype = query_states.dtype
+        if input_dtype == torch.float32:
+            # Handle the case where the model is quantized
+            if hasattr(self.config, "_pre_quantization_dtype"):
+                target_dtype = self.config._pre_quantization_dtype
+            else:
+                target_dtype = self.q_proj.weight.dtype
+            logger.warning_once(
+                f"The input hidden states seems to be silently casted in float32, this might be related to"
+                f" the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in"
+                f" {target_dtype}."
+            )
+            query_states = query_states.to(target_dtype)
+            key_states = key_states.to(target_dtype)
+            value_states = value_states.to(target_dtype)
+        # Reashape to the expected shape for Flash Attention
+        query_states = query_states.transpose(1, 2)
+        key_states = key_states.transpose(1, 2)
+        value_states = value_states.transpose(1, 2)
+        attn_output = self._flash_attention_forward(
+            query_states,
+            key_states,
+            value_states,
+            attention_mask,
+            q_len,
+            dropout=dropout_rate,
+            use_sliding_windows=use_sliding_windows,
+        )
+        attn_output = attn_output.reshape(bsz * q_len, self.attn_hsz).contiguous()
+        final_attn_output = torch.zeros(
+            (bsz * q_len, self.hidden_size),
+            dtype=hidden_states.dtype,
+            device=hidden_states.device,
+        )
+        for expert_idx in range(self.num_key_value_groups):
+            expert_layer = self.o_proj[expert_idx]
+            idx, top_x = torch.where(expert_mask[expert_idx])
+            top_x_list = top_x.tolist()
+            idx_list = idx.tolist()
+            expert_inputs = attn_output[None, top_x_list].reshape(-1, self.attn_hsz)
+            expert_outs = expert_layer(expert_inputs) * routing_weights[top_x_list, idx_list, None] * self.scale_factor_attn
+            final_attn_output.index_add_(0, top_x, expert_outs.to(final_attn_output.dtype))
+        final_attn_output = final_attn_output.view(bsz, q_len, self.hidden_size)
+        if not output_attentions:
+            attn_weights = None
+        return final_attn_output, attn_weights, past_key_value, router_logits
+    @torch.no_grad()
+    def from_vanilla_attention(attention: MixtralAttention, top_k_attn, scale_factor_attn):
+        # config
+        layer_idx = attention.layer_idx
+        config = attention.config
+        config.top_k_attn = top_k_attn
+        config.scale_factor_attn = scale_factor_attn
+        # init
+        attention_moe = MixtralFlashAttention2MoE(config, layer_idx)
+        # copy weights
+        num_key_value_groups = attention_moe.num_key_value_groups
+        head_dim = attention_moe.head_dim
+        for i in range(num_key_value_groups):
+            indices_q_o = []
+            for j in range(attention_moe.num_key_value_heads):
+                k = i + j * num_key_value_groups
+                indices_q_o.extend(
+                    list(range(k * head_dim, (k + 1) * head_dim))
+                )
+            print(i, "indices_q_o", indices_q_o)
+            attention_moe.q_proj[i].weight.data = attention.q_proj.weight.data[indices_q_o].clone()
+            attention_moe.o_proj[i].weight.data = attention.o_proj.weight.data[:, indices_q_o].clone()
+        return attention_moe
 class MixtralBLockSparseTop2MLP(nn.Module):
     def __init__(self, config: MixtralConfig, ffn_dim, add_rescale_bias=False):  # 🔍
         super().__init__()
 # 🔍
 MISTRAL_ATTENTION_MOE_CLASSES = {
     "eager": MixtralAttentionMoE,
+    "flash_attention_2": MixtralFlashAttention2MoE,
 }
         )
         self.use_attn_moe = config.use_attn_moe
+        if self.use_attn_moe:
+            attn_class = MISTRAL_ATTENTION_MOE_CLASSES[config._attn_implementation]
+        else:
+            attn_class = MISTRAL_ATTENTION_CLASSES[config._attn_implementation]
+        self.self_attn = attn_class(config, layer_idx)
         if self.is_moe:
             self.block_sparse_moe = MixtralSparseMoeBlock(config)
             self.mlp_residual = (
                 MixtralBLockSparseTop2MLP(config, config.intermediate_size_residual)
             )
         else:
             self.block_sparse_moe = MixtralBLockSparseTop2MLP(
                 config, config.intermediate_size * config.num_local_experts
             )
         hidden_states = self.input_layernorm(hidden_states)
         # 🔍 Self Attention
+        if self.use_attn_moe:
             (
                 hidden_states,
                 self_attn_weights,
         # Fully Connected
         residual = hidden_states
+        hidden_states_input = self.post_attention_layernorm(hidden_states)
         # 🔍
         if self.is_moe:
+            hidden_states, router_logits = self.block_sparse_moe(hidden_states_input)
         else:
+            hidden_states = self.block_sparse_moe(hidden_states_input)
             router_logits = None
         if self.mlp_residual is not None:
+            hidden_states += self.mlp_residual(hidden_states_input)  #
         hidden_states = residual + hidden_states
         outputs = (hidden_states,)
             if len(valid_attn_router_logits) > 0:  # exist logits that is not None
                 attn_aux_loss = load_balancing_loss_func(
                     valid_attn_router_logits,
+                    self.config.attn_experts,
                     self.config.top_k_attn,
                     use_layer_wise_balance=self.config.use_layer_wise_balance,  # ✨
                 )
             if past is None:
                 if self.config.use_attn_moe:  # 🔍
                     model_kwargs["past_key_values"] = MoECache(
+                        # self.config.num_key_value_heads
+                        self.config.attn_experts
                     )
                 else:  # 🔍
                     model_kwargs["past_key_values"] = DynamicCache()

trainer_state.json CHANGED Viewed

@@ -1,1278 +1,2398 @@
 {
   "best_metric": null,
   "best_model_checkpoint": null,
-  "epoch": 1.8575851393188856,
   "eval_steps": 500,
-  "global_step": 1800,
   "is_hyper_param_search": false,
   "is_local_process_zero": true,
   "is_world_process_zero": true,
   "log_history": [
     {
-      "epoch": 0.010319917440660475,
-      "grad_norm": 2.8247148990631104,
-      "learning_rate": 2.2727272727272728e-06,
-      "loss": 0.8288,
-      "step": 10
     },
     {
-      "epoch": 0.02063983488132095,
-      "grad_norm": 1.1619349718093872,
-      "learning_rate": 4.5454545454545455e-06,
-      "loss": 0.7902,
-      "step": 20
     },
     {
-      "epoch": 0.030959752321981424,
-      "grad_norm": 0.7691543698310852,
-      "learning_rate": 6.818181818181818e-06,
-      "loss": 0.7388,
-      "step": 30
     },
     {
-      "epoch": 0.0412796697626419,
-      "grad_norm": 0.687256395816803,
-      "learning_rate": 9.090909090909091e-06,
-      "loss": 0.7177,
-      "step": 40
     },
     {
-      "epoch": 0.05159958720330237,
-      "grad_norm": 0.6163066029548645,
-      "learning_rate": 1.1363636363636366e-05,
-      "loss": 0.701,
-      "step": 50
     },
     {
-      "epoch": 0.06191950464396285,
-      "grad_norm": 0.6468276381492615,
-      "learning_rate": 1.3636363636363637e-05,
-      "loss": 0.6853,
-      "step": 60
     },
     {
-      "epoch": 0.07223942208462332,
-      "grad_norm": 0.9129849672317505,
-      "learning_rate": 1.590909090909091e-05,
-      "loss": 0.6749,
-      "step": 70
     },
     {
-      "epoch": 0.0825593395252838,
-      "grad_norm": 0.9610547423362732,
-      "learning_rate": 1.8181818181818182e-05,
-      "loss": 0.664,
-      "step": 80
     },
     {
-      "epoch": 0.09287925696594428,
-      "grad_norm": 0.9436660408973694,
-      "learning_rate": 1.9999975160696756e-05,
-      "loss": 0.6637,
-      "step": 90
     },
     {
-      "epoch": 0.10319917440660474,
-      "grad_norm": 0.828860878944397,
-      "learning_rate": 1.999910579803988e-05,
-      "loss": 0.6578,
-      "step": 100
     },
     {
-      "epoch": 0.11351909184726522,
-      "grad_norm": 0.8615094423294067,
-      "learning_rate": 1.9996994593616145e-05,
-      "loss": 0.6473,
-      "step": 110
     },
     {
-      "epoch": 0.1238390092879257,
-      "grad_norm": 0.8153389096260071,
-      "learning_rate": 1.9993641809627166e-05,
-      "loss": 0.6402,
-      "step": 120
     },
     {
-      "epoch": 0.13415892672858618,
-      "grad_norm": 0.8015602827072144,
-      "learning_rate": 1.9989047862472904e-05,
-      "loss": 0.6378,
-      "step": 130
     },
     {
-      "epoch": 0.14447884416924664,
-      "grad_norm": 0.7367793321609497,
-      "learning_rate": 1.9983213322699926e-05,
-      "loss": 0.6346,
-      "step": 140
     },
     {
-      "epoch": 0.15479876160990713,
-      "grad_norm": 0.913837194442749,
-      "learning_rate": 1.997613891493054e-05,
-      "loss": 0.6322,
-      "step": 150
     },
     {
-      "epoch": 0.1651186790505676,
-      "grad_norm": 0.7812018990516663,
-      "learning_rate": 1.996782551777282e-05,
-      "loss": 0.6206,
-      "step": 160
     },
     {
-      "epoch": 0.17543859649122806,
-      "grad_norm": 0.7282320857048035,
-      "learning_rate": 1.995827416371147e-05,
-      "loss": 0.6127,
-      "step": 170
     },
     {
-      "epoch": 0.18575851393188855,
-      "grad_norm": 0.7379522919654846,
-      "learning_rate": 1.9947486038979606e-05,
-      "loss": 0.6098,
-      "step": 180
     },
     {
-      "epoch": 0.19607843137254902,
-      "grad_norm": 0.7425850033760071,
-      "learning_rate": 1.993546248341142e-05,
-      "loss": 0.6079,
-      "step": 190
     },
     {
-      "epoch": 0.20639834881320948,
-      "grad_norm": 0.6972795724868774,
-      "learning_rate": 1.9922204990275788e-05,
-      "loss": 0.6006,
-      "step": 200
     },
     {
-      "epoch": 0.21671826625386997,
-      "grad_norm": 0.7257381677627563,
-      "learning_rate": 1.9907715206090817e-05,
-      "loss": 0.6042,
-      "step": 210
     },
     {
-      "epoch": 0.22703818369453044,
-      "grad_norm": 0.6420859098434448,
-      "learning_rate": 1.989199493041935e-05,
-      "loss": 0.593,
-      "step": 220
     },
     {
-      "epoch": 0.23735810113519093,
-      "grad_norm": 0.7002107501029968,
-      "learning_rate": 1.9875046115645443e-05,
-      "loss": 0.5931,
-      "step": 230
     },
     {
-      "epoch": 0.2476780185758514,
-      "grad_norm": 0.7185678482055664,
-      "learning_rate": 1.9856870866731946e-05,
-      "loss": 0.5926,
-      "step": 240
     },
     {
-      "epoch": 0.2579979360165119,
-      "grad_norm": 0.6459465026855469,
-      "learning_rate": 1.983747144095902e-05,
-      "loss": 0.5878,
-      "step": 250
     },
     {
-      "epoch": 0.26831785345717235,
-      "grad_norm": 0.6379982233047485,
-      "learning_rate": 1.9816850247643834e-05,
-      "loss": 0.5796,
-      "step": 260
     },
     {
-      "epoch": 0.2786377708978328,
-      "grad_norm": 0.7094199061393738,
-      "learning_rate": 1.97950098478413e-05,
-      "loss": 0.5771,
-      "step": 270
     },
     {
-      "epoch": 0.2889576883384933,
-      "grad_norm": 0.6646308302879333,
-      "learning_rate": 1.9771952954026038e-05,
-      "loss": 0.5767,
-      "step": 280
     },
     {
-      "epoch": 0.29927760577915374,
-      "grad_norm": 0.6392974257469177,
-      "learning_rate": 1.9747682429755493e-05,
-      "loss": 0.5737,
-      "step": 290
     },
     {
-      "epoch": 0.30959752321981426,
-      "grad_norm": 0.5905966758728027,
-      "learning_rate": 1.972220128931427e-05,
-      "loss": 0.576,
-      "step": 300
     },
     {
-      "epoch": 0.31991744066047473,
-      "grad_norm": 0.8001016974449158,
-      "learning_rate": 1.9695512697339797e-05,
-      "loss": 0.5698,
-      "step": 310
     },
     {
-      "epoch": 0.3302373581011352,
-      "grad_norm": 0.5997283458709717,
-      "learning_rate": 1.966761996842929e-05,
-      "loss": 0.5703,
-      "step": 320
     },
     {
-      "epoch": 0.34055727554179566,
-      "grad_norm": 0.6440294981002808,
-      "learning_rate": 1.9638526566728088e-05,
-      "loss": 0.5584,
-      "step": 330
     },
     {
-      "epoch": 0.3508771929824561,
-      "grad_norm": 0.7667876482009888,
-      "learning_rate": 1.960823610549943e-05,
-      "loss": 0.5585,
-      "step": 340
     },
     {
-      "epoch": 0.36119711042311664,
-      "grad_norm": 0.6358545422554016,
-      "learning_rate": 1.9576752346675692e-05,
-      "loss": 0.5578,
-      "step": 350
     },
     {
-      "epoch": 0.3715170278637771,
-      "grad_norm": 0.6375100612640381,
-      "learning_rate": 1.954407920039119e-05,
-      "loss": 0.5621,
-      "step": 360
     },
     {
-      "epoch": 0.38183694530443757,
-      "grad_norm": 0.7324113845825195,
-      "learning_rate": 1.951022072449655e-05,
-      "loss": 0.5527,
-      "step": 370
     },
     {
-      "epoch": 0.39215686274509803,
-      "grad_norm": 0.658400297164917,
-      "learning_rate": 1.9475181124054742e-05,
-      "loss": 0.5538,
-      "step": 380
     },
     {
-      "epoch": 0.4024767801857585,
-      "grad_norm": 0.7300146222114563,
-      "learning_rate": 1.9438964750818833e-05,
-      "loss": 0.5494,
-      "step": 390
     },
     {
-      "epoch": 0.41279669762641896,
-      "grad_norm": 0.7315788865089417,
-      "learning_rate": 1.940157610269152e-05,
-      "loss": 0.5493,
-      "step": 400
     },
     {
-      "epoch": 0.4231166150670795,
-      "grad_norm": 0.6689688563346863,
-      "learning_rate": 1.9363019823166506e-05,
-      "loss": 0.5509,
-      "step": 410
     },
     {
-      "epoch": 0.43343653250773995,
-      "grad_norm": 0.6882718205451965,
-      "learning_rate": 1.9323300700751816e-05,
-      "loss": 0.5473,
-      "step": 420
     },
     {
-      "epoch": 0.4437564499484004,
-      "grad_norm": 0.6466957330703735,
-      "learning_rate": 1.9282423668375064e-05,
-      "loss": 0.5435,
-      "step": 430
     },
     {
-      "epoch": 0.4540763673890609,
-      "grad_norm": 0.6492331624031067,
-      "learning_rate": 1.9240393802770824e-05,
-      "loss": 0.5449,
-      "step": 440
     },
     {
-      "epoch": 0.46439628482972134,
-      "grad_norm": 0.5815872550010681,
-      "learning_rate": 1.9197216323850122e-05,
-      "loss": 0.5398,
-      "step": 450
     },
     {
-      "epoch": 0.47471620227038186,
-      "grad_norm": 0.6003971099853516,
-      "learning_rate": 1.9152896594052134e-05,
-      "loss": 0.533,
-      "step": 460
     },
     {
-      "epoch": 0.4850361197110423,
-      "grad_norm": 0.5987655520439148,
-      "learning_rate": 1.910744011767821e-05,
-      "loss": 0.5309,
-      "step": 470
     },
     {
-      "epoch": 0.4953560371517028,
-      "grad_norm": 0.6432524919509888,
-      "learning_rate": 1.9060852540208277e-05,
-      "loss": 0.5344,
-      "step": 480
     },
     {
-      "epoch": 0.5056759545923633,
-      "grad_norm": 0.5650415420532227,
-      "learning_rate": 1.9013139647599656e-05,
-      "loss": 0.5333,
-      "step": 490
     },
     {
-      "epoch": 0.5159958720330238,
-      "grad_norm": 0.6225659847259521,
-      "learning_rate": 1.8964307365568513e-05,
-      "loss": 0.5231,
-      "step": 500
     },
     {
-      "epoch": 0.5263157894736842,
-      "grad_norm": 0.6020525097846985,
-      "learning_rate": 1.89143617588539e-05,
-      "loss": 0.5241,
-      "step": 510
     },
     {
-      "epoch": 0.5366357069143447,
-      "grad_norm": 0.5726006031036377,
-      "learning_rate": 1.886330903046454e-05,
-      "loss": 0.5278,
-      "step": 520
     },
     {
-      "epoch": 0.5469556243550051,
-      "grad_norm": 0.5783742666244507,
-      "learning_rate": 1.8811155520908445e-05,
-      "loss": 0.5253,
-      "step": 530
     },
     {
-      "epoch": 0.5572755417956656,
-      "grad_norm": 0.5478541254997253,
-      "learning_rate": 1.8757907707405456e-05,
-      "loss": 0.5166,
-      "step": 540
     },
     {
-      "epoch": 0.5675954592363261,
-      "grad_norm": 0.5668419003486633,
-      "learning_rate": 1.8703572203082795e-05,
-      "loss": 0.5206,
-      "step": 550
     },
     {
-      "epoch": 0.5779153766769866,
-      "grad_norm": 0.5729948282241821,
-      "learning_rate": 1.8648155756153768e-05,
-      "loss": 0.516,
-      "step": 560
     },
     {
-      "epoch": 0.5882352941176471,
-      "grad_norm": 0.651300311088562,
-      "learning_rate": 1.859166524907963e-05,
-      "loss": 0.5183,
-      "step": 570
     },
     {
-      "epoch": 0.5985552115583075,
-      "grad_norm": 0.6236013174057007,
-      "learning_rate": 1.8534107697714864e-05,
-      "loss": 0.5242,
-      "step": 580
     },
     {
-      "epoch": 0.608875128998968,
-      "grad_norm": 0.5427743196487427,
-      "learning_rate": 1.84754902504358e-05,
-      "loss": 0.5291,
-      "step": 590
     },
     {
-      "epoch": 0.6191950464396285,
-      "grad_norm": 0.5849993824958801,
-      "learning_rate": 1.8415820187252847e-05,
-      "loss": 0.5213,
-      "step": 600
     },
     {
-      "epoch": 0.6295149638802889,
-      "grad_norm": 0.6405364274978638,
-      "learning_rate": 1.8355104918906353e-05,
-      "loss": 0.5187,
-      "step": 610
     },
     {
-      "epoch": 0.6398348813209495,
-      "grad_norm": 0.5616128444671631,
-      "learning_rate": 1.8293351985946194e-05,
-      "loss": 0.5108,
-      "step": 620
     },
     {
-      "epoch": 0.6501547987616099,
-      "grad_norm": 0.5770090222358704,
-      "learning_rate": 1.823056905779532e-05,
-      "loss": 0.5172,
-      "step": 630
     },
     {
-      "epoch": 0.6604747162022704,
-      "grad_norm": 0.5251275300979614,
-      "learning_rate": 1.816676393179721e-05,
-      "loss": 0.5116,
-      "step": 640
     },
     {
-      "epoch": 0.6707946336429309,
-      "grad_norm": 0.5879736542701721,
-      "learning_rate": 1.8101944532247495e-05,
-      "loss": 0.5157,
-      "step": 650
     },
     {
-      "epoch": 0.6811145510835913,
-      "grad_norm": 0.5661890506744385,
-      "learning_rate": 1.80361189094098e-05,
-      "loss": 0.5088,
-      "step": 660
     },
     {
-      "epoch": 0.6914344685242518,
-      "grad_norm": 0.5618740916252136,
-      "learning_rate": 1.796929523851593e-05,
-      "loss": 0.5111,
-      "step": 670
     },
     {
-      "epoch": 0.7017543859649122,
-      "grad_norm": 0.5378845930099487,
-      "learning_rate": 1.790148181875055e-05,
-      "loss": 0.5118,
-      "step": 680
     },
     {
-      "epoch": 0.7120743034055728,
-      "grad_norm": 0.5547090172767639,
-      "learning_rate": 1.783268707222048e-05,
-      "loss": 0.5088,
-      "step": 690
     },
     {
-      "epoch": 0.7223942208462333,
-      "grad_norm": 0.5933310389518738,
-      "learning_rate": 1.776291954290867e-05,
-      "loss": 0.5063,
-      "step": 700
     },
     {
-      "epoch": 0.7327141382868937,
-      "grad_norm": 0.5393312573432922,
-      "learning_rate": 1.769218789561312e-05,
-      "loss": 0.5014,
-      "step": 710
     },
     {
-      "epoch": 0.7430340557275542,
-      "grad_norm": 0.5515422821044922,
-      "learning_rate": 1.7620500914870734e-05,
-      "loss": 0.5116,
-      "step": 720
     },
     {
-      "epoch": 0.7533539731682146,
-      "grad_norm": 0.5601432919502258,
-      "learning_rate": 1.7547867503866315e-05,
-      "loss": 0.5024,
-      "step": 730
     },
     {
-      "epoch": 0.7636738906088751,
-      "grad_norm": 0.5876237154006958,
-      "learning_rate": 1.7474296683326844e-05,
-      "loss": 0.5098,
-      "step": 740
     },
     {
-      "epoch": 0.7739938080495357,
-      "grad_norm": 0.518947184085846,
-      "learning_rate": 1.739979759040114e-05,
-      "loss": 0.5017,
-      "step": 750
     },
     {
-      "epoch": 0.7843137254901961,
-      "grad_norm": 0.5550107955932617,
-      "learning_rate": 1.7324379477525086e-05,
-      "loss": 0.5044,
-      "step": 760
     },
     {
-      "epoch": 0.7946336429308566,
-      "grad_norm": 0.5430490374565125,
-      "learning_rate": 1.724805171127249e-05,
-      "loss": 0.5029,
-      "step": 770
     },
     {
-      "epoch": 0.804953560371517,
-      "grad_norm": 0.5498166680335999,
-      "learning_rate": 1.7170823771191824e-05,
-      "loss": 0.499,
-      "step": 780
     },
     {
-      "epoch": 0.8152734778121775,
-      "grad_norm": 0.5843333601951599,
-      "learning_rate": 1.709270524862891e-05,
-      "loss": 0.4968,
-      "step": 790
     },
     {
-      "epoch": 0.8255933952528379,
-      "grad_norm": 0.5710884928703308,
-      "learning_rate": 1.7013705845535704e-05,
-      "loss": 0.5024,
-      "step": 800
     },
     {
-      "epoch": 0.8359133126934984,
-      "grad_norm": 0.5185025930404663,
-      "learning_rate": 1.6933835373265373e-05,
-      "loss": 0.503,
-      "step": 810
     },
     {
-      "epoch": 0.846233230134159,
-      "grad_norm": 0.5252718329429626,
-      "learning_rate": 1.685310375135376e-05,
-      "loss": 0.5028,
-      "step": 820
     },
     {
-      "epoch": 0.8565531475748194,
-      "grad_norm": 0.5351059436798096,
-      "learning_rate": 1.6771521006287442e-05,
-      "loss": 0.4927,
-      "step": 830
     },
     {
-      "epoch": 0.8668730650154799,
-      "grad_norm": 0.5176792740821838,
-      "learning_rate": 1.6689097270258463e-05,
-      "loss": 0.5012,
-      "step": 840
     },
     {
-      "epoch": 0.8771929824561403,
-      "grad_norm": 0.5016619563102722,
-      "learning_rate": 1.6605842779905984e-05,
-      "loss": 0.4941,
-      "step": 850
     },
     {
-      "epoch": 0.8875128998968008,
-      "grad_norm": 0.536718487739563,
-      "learning_rate": 1.6521767875044935e-05,
-      "loss": 0.488,
-      "step": 860
     },
     {
-      "epoch": 0.8978328173374613,
-      "grad_norm": 0.49594587087631226,
-      "learning_rate": 1.643688299738186e-05,
-      "loss": 0.4901,
-      "step": 870
     },
     {
-      "epoch": 0.9081527347781218,
-      "grad_norm": 0.5281170606613159,
-      "learning_rate": 1.635119868921809e-05,
-      "loss": 0.4979,
-      "step": 880
     },
     {
-      "epoch": 0.9184726522187823,
-      "grad_norm": 0.5000081658363342,
-      "learning_rate": 1.6264725592140468e-05,
-      "loss": 0.4935,
-      "step": 890
     },
     {
-      "epoch": 0.9287925696594427,
-      "grad_norm": 0.5359088182449341,
-      "learning_rate": 1.6177474445699695e-05,
-      "loss": 0.4854,
-      "step": 900
     },
     {
-      "epoch": 0.9391124871001032,
-      "grad_norm": 0.5657668709754944,
-      "learning_rate": 1.6089456086076527e-05,
-      "loss": 0.4877,
-      "step": 910
     },
     {
-      "epoch": 0.9494324045407637,
-      "grad_norm": 0.507234513759613,
-      "learning_rate": 1.6000681444735976e-05,
-      "loss": 0.4903,
-      "step": 920
     },
     {
-      "epoch": 0.9597523219814241,
-      "grad_norm": 0.5578757524490356,
-      "learning_rate": 1.5911161547069688e-05,
-      "loss": 0.4884,
-      "step": 930
     },
     {
-      "epoch": 0.9700722394220846,
-      "grad_norm": 0.5635477304458618,
-      "learning_rate": 1.582090751102662e-05,
-      "loss": 0.4973,
-      "step": 940
     },
     {
-      "epoch": 0.9803921568627451,
-      "grad_norm": 0.5168154835700989,
-      "learning_rate": 1.5729930545732247e-05,
-      "loss": 0.4818,
-      "step": 950
     },
     {
-      "epoch": 0.9907120743034056,
-      "grad_norm": 0.5357134342193604,
-      "learning_rate": 1.5638241950096458e-05,
-      "loss": 0.4863,
-      "step": 960
     },
     {
-      "epoch": 1.001031991744066,
-      "grad_norm": 1.1038967370986938,
-      "learning_rate": 1.554585311141027e-05,
-      "loss": 0.4791,
-      "step": 970
     },
     {
-      "epoch": 1.0113519091847265,
-      "grad_norm": 0.6728698015213013,
-      "learning_rate": 1.5452775503931566e-05,
-      "loss": 0.4229,
-      "step": 980
     },
     {
-      "epoch": 1.021671826625387,
-      "grad_norm": 0.5582284331321716,
-      "learning_rate": 1.5359020687460096e-05,
-      "loss": 0.4193,
-      "step": 990
     },
     {
-      "epoch": 1.0319917440660475,
-      "grad_norm": 0.5344264507293701,
-      "learning_rate": 1.5264600305901744e-05,
-      "loss": 0.4241,
-      "step": 1000
     },
     {
-      "epoch": 1.0423116615067078,
-      "grad_norm": 0.5118332505226135,
-      "learning_rate": 1.5169526085822451e-05,
-      "loss": 0.4178,
-      "step": 1010
     },
     {
-      "epoch": 1.0526315789473684,
-      "grad_norm": 0.54106605052948,
-      "learning_rate": 1.5073809834991816e-05,
-      "loss": 0.4167,
-      "step": 1020
     },
     {
-      "epoch": 1.0629514963880289,
-      "grad_norm": 0.591042697429657,
-      "learning_rate": 1.4977463440916621e-05,
-      "loss": 0.4154,
-      "step": 1030
     },
     {
-      "epoch": 1.0732714138286894,
-      "grad_norm": 0.5546119809150696,
-      "learning_rate": 1.4880498869364482e-05,
-      "loss": 0.4211,
-      "step": 1040
     },
     {
-      "epoch": 1.08359133126935,
-      "grad_norm": 0.5102314352989197,
-      "learning_rate": 1.4782928162877722e-05,
-      "loss": 0.4187,
-      "step": 1050
     },
     {
-      "epoch": 1.0939112487100102,
-      "grad_norm": 0.5234063863754272,
-      "learning_rate": 1.468476343927778e-05,
-      "loss": 0.4177,
-      "step": 1060
     },
     {
-      "epoch": 1.1042311661506707,
-      "grad_norm": 0.5099871158599854,
-      "learning_rate": 1.4586016890160208e-05,
-      "loss": 0.4213,
-      "step": 1070
     },
     {
-      "epoch": 1.1145510835913313,
-      "grad_norm": 0.5453868508338928,
-      "learning_rate": 1.4486700779380547e-05,
-      "loss": 0.4192,
-      "step": 1080
     },
     {
-      "epoch": 1.1248710010319918,
-      "grad_norm": 0.5475857257843018,
-      "learning_rate": 1.4386827441531202e-05,
-      "loss": 0.4178,
-      "step": 1090
     },
     {
-      "epoch": 1.1351909184726523,
-      "grad_norm": 0.5636183619499207,
-      "learning_rate": 1.4286409280409558e-05,
-      "loss": 0.4167,
-      "step": 1100
     },
     {
-      "epoch": 1.1455108359133126,
-      "grad_norm": 0.5477967262268066,
-      "learning_rate": 1.4185458767477487e-05,
-      "loss": 0.4184,
-      "step": 1110
     },
     {
-      "epoch": 1.1558307533539731,
-      "grad_norm": 0.5478163361549377,
-      "learning_rate": 1.4083988440312429e-05,
-      "loss": 0.419,
-      "step": 1120
     },
     {
-      "epoch": 1.1661506707946336,
-      "grad_norm": 0.5689426064491272,
-      "learning_rate": 1.3982010901050305e-05,
-      "loss": 0.4239,
-      "step": 1130
     },
     {
-      "epoch": 1.1764705882352942,
-      "grad_norm": 0.5106656551361084,
-      "learning_rate": 1.3879538814820395e-05,
-      "loss": 0.4135,
-      "step": 1140
     },
     {
-      "epoch": 1.1867905056759547,
-      "grad_norm": 0.5251624584197998,
-      "learning_rate": 1.3776584908172364e-05,
-      "loss": 0.4202,
-      "step": 1150
     },
     {
-      "epoch": 1.197110423116615,
-      "grad_norm": 0.5535441040992737,
-      "learning_rate": 1.3673161967495708e-05,
-      "loss": 0.4181,
-      "step": 1160
     },
     {
-      "epoch": 1.2074303405572755,
-      "grad_norm": 0.5619220733642578,
-      "learning_rate": 1.3569282837431737e-05,
-      "loss": 0.4202,
-      "step": 1170
     },
     {
-      "epoch": 1.217750257997936,
-      "grad_norm": 0.5495029091835022,
-      "learning_rate": 1.3464960419278332e-05,
-      "loss": 0.4135,
-      "step": 1180
     },
     {
-      "epoch": 1.2280701754385965,
-      "grad_norm": 0.5409591197967529,
-      "learning_rate": 1.336020766938766e-05,
-      "loss": 0.4099,
-      "step": 1190
     },
     {
-      "epoch": 1.238390092879257,
-      "grad_norm": 0.5582126379013062,
-      "learning_rate": 1.3255037597557057e-05,
-      "loss": 0.4168,
-      "step": 1200
     },
     {
-      "epoch": 1.2487100103199174,
-      "grad_norm": 0.5315924882888794,
-      "learning_rate": 1.3149463265413282e-05,
-      "loss": 0.4163,
-      "step": 1210
     },
     {
-      "epoch": 1.2590299277605779,
-      "grad_norm": 0.5000606775283813,
-      "learning_rate": 1.3043497784790315e-05,
-      "loss": 0.4155,
-      "step": 1220
     },
     {
-      "epoch": 1.2693498452012384,
-      "grad_norm": 0.5188019275665283,
-      "learning_rate": 1.2937154316100927e-05,
-      "loss": 0.4155,
-      "step": 1230
     },
     {
-      "epoch": 1.279669762641899,
-      "grad_norm": 0.5054394006729126,
-      "learning_rate": 1.283044606670223e-05,
-      "loss": 0.4079,
-      "step": 1240
     },
     {
-      "epoch": 1.2899896800825594,
-      "grad_norm": 0.5096462368965149,
-      "learning_rate": 1.2723386289255374e-05,
-      "loss": 0.4149,
-      "step": 1250
     },
     {
-      "epoch": 1.3003095975232197,
-      "grad_norm": 0.5191652178764343,
-      "learning_rate": 1.2615988280079645e-05,
-      "loss": 0.4103,
-      "step": 1260
     },
     {
-      "epoch": 1.3106295149638802,
-      "grad_norm": 0.4963880777359009,
-      "learning_rate": 1.2508265377501102e-05,
-      "loss": 0.4117,
-      "step": 1270
     },
     {
-      "epoch": 1.3209494324045408,
-      "grad_norm": 0.5644184947013855,
-      "learning_rate": 1.240023096019603e-05,
-      "loss": 0.4139,
-      "step": 1280
     },
     {
-      "epoch": 1.3312693498452013,
-      "grad_norm": 0.521536111831665,
-      "learning_rate": 1.2291898445529384e-05,
-      "loss": 0.4107,
-      "step": 1290
     },
     {
-      "epoch": 1.3415892672858618,
-      "grad_norm": 0.5256720781326294,
-      "learning_rate": 1.2183281287888398e-05,
-      "loss": 0.4104,
-      "step": 1300
     },
     {
-      "epoch": 1.351909184726522,
-      "grad_norm": 0.531589686870575,
-      "learning_rate": 1.2074392977011629e-05,
-      "loss": 0.4111,
-      "step": 1310
     },
     {
-      "epoch": 1.3622291021671826,
-      "grad_norm": 0.534598171710968,
-      "learning_rate": 1.1965247036313573e-05,
-      "loss": 0.416,
-      "step": 1320
     },
     {
-      "epoch": 1.3725490196078431,
-      "grad_norm": 0.5281124711036682,
-      "learning_rate": 1.185585702120515e-05,
-      "loss": 0.4041,
-      "step": 1330
     },
     {
-      "epoch": 1.3828689370485037,
-      "grad_norm": 0.5332800149917603,
-      "learning_rate": 1.1746236517410155e-05,
-      "loss": 0.4076,
-      "step": 1340
     },
     {
-      "epoch": 1.3931888544891642,
-      "grad_norm": 0.4961317181587219,
-      "learning_rate": 1.1636399139277998e-05,
-      "loss": 0.4067,
-      "step": 1350
     },
     {
-      "epoch": 1.4035087719298245,
-      "grad_norm": 0.5210182070732117,
-      "learning_rate": 1.1526358528092861e-05,
-      "loss": 0.4071,
-      "step": 1360
     },
     {
-      "epoch": 1.413828689370485,
-      "grad_norm": 0.518181324005127,
-      "learning_rate": 1.1416128350379503e-05,
-      "loss": 0.4118,
-      "step": 1370
     },
     {
-      "epoch": 1.4241486068111455,
-      "grad_norm": 0.5396980047225952,
-      "learning_rate": 1.1305722296205968e-05,
-      "loss": 0.4073,
-      "step": 1380
     },
     {
-      "epoch": 1.434468524251806,
-      "grad_norm": 0.5073665976524353,
-      "learning_rate": 1.1195154077483313e-05,
-      "loss": 0.4083,
-      "step": 1390
     },
     {
-      "epoch": 1.4447884416924666,
-      "grad_norm": 0.5103346705436707,
-      "learning_rate": 1.1084437426262666e-05,
-      "loss": 0.4094,
-      "step": 1400
     },
     {
-      "epoch": 1.4551083591331269,
-      "grad_norm": 0.5441737174987793,
-      "learning_rate": 1.097358609302978e-05,
-      "loss": 0.4124,
-      "step": 1410
     },
     {
-      "epoch": 1.4654282765737874,
-      "grad_norm": 0.49091413617134094,
-      "learning_rate": 1.0862613844997272e-05,
-      "loss": 0.4059,
-      "step": 1420
     },
     {
-      "epoch": 1.475748194014448,
-      "grad_norm": 0.49451103806495667,
-      "learning_rate": 1.0751534464394809e-05,
-      "loss": 0.4028,
-      "step": 1430
     },
     {
-      "epoch": 1.4860681114551084,
-      "grad_norm": 0.5205165147781372,
-      "learning_rate": 1.0640361746757413e-05,
-      "loss": 0.4038,
-      "step": 1440
     },
     {
-      "epoch": 1.496388028895769,
-      "grad_norm": 0.5233325958251953,
-      "learning_rate": 1.0529109499212137e-05,
-      "loss": 0.4097,
-      "step": 1450
     },
     {
-      "epoch": 1.5067079463364292,
-      "grad_norm": 0.5237818956375122,
-      "learning_rate": 1.0417791538763269e-05,
-      "loss": 0.4059,
-      "step": 1460
     },
     {
-      "epoch": 1.5170278637770898,
-      "grad_norm": 0.5263275504112244,
-      "learning_rate": 1.0306421690576318e-05,
-      "loss": 0.4074,
-      "step": 1470
     },
     {
-      "epoch": 1.5273477812177503,
-      "grad_norm": 0.5042173862457275,
-      "learning_rate": 1.0195013786261017e-05,
-      "loss": 0.4061,
-      "step": 1480
     },
     {
-      "epoch": 1.5376676986584106,
-      "grad_norm": 0.48727792501449585,
-      "learning_rate": 1.0083581662153488e-05,
-      "loss": 0.4021,
-      "step": 1490
     },
     {
-      "epoch": 1.5479876160990713,
-      "grad_norm": 0.5014871954917908,
-      "learning_rate": 9.972139157597836e-06,
-      "loss": 0.411,
-      "step": 1500
     },
     {
-      "epoch": 1.5583075335397316,
-      "grad_norm": 0.49665823578834534,
-      "learning_rate": 9.86070011322737e-06,
-      "loss": 0.4069,
-      "step": 1510
     },
     {
-      "epoch": 1.5686274509803921,
-      "grad_norm": 0.48189592361450195,
-      "learning_rate": 9.749278369245658e-06,
-      "loss": 0.4055,
-      "step": 1520
     },
     {
-      "epoch": 1.5789473684210527,
-      "grad_norm": 0.5003267526626587,
-      "learning_rate": 9.637887763707649e-06,
-      "loss": 0.4023,
-      "step": 1530
     },
     {
-      "epoch": 1.589267285861713,
-      "grad_norm": 0.4762038290500641,
-      "learning_rate": 9.52654213080103e-06,
-      "loss": 0.4063,
-      "step": 1540
     },
     {
-      "epoch": 1.5995872033023737,
-      "grad_norm": 0.48036977648735046,
-      "learning_rate": 9.415255299128115e-06,
-      "loss": 0.3991,
-      "step": 1550
     },
     {
-      "epoch": 1.609907120743034,
-      "grad_norm": 1.7054091691970825,
-      "learning_rate": 9.304041089988367e-06,
-      "loss": 0.4099,
-      "step": 1560
     },
     {
-      "epoch": 1.6202270381836945,
-      "grad_norm": 0.5128041505813599,
-      "learning_rate": 9.192913315661887e-06,
-      "loss": 0.4093,
-      "step": 1570
     },
     {
-      "epoch": 1.630546955624355,
-      "grad_norm": 0.5168408751487732,
-      "learning_rate": 9.081885777693969e-06,
-      "loss": 0.4012,
-      "step": 1580
     },
     {
-      "epoch": 1.6408668730650153,
-      "grad_norm": 0.4789281189441681,
-      "learning_rate": 8.97097226518103e-06,
-      "loss": 0.4024,
-      "step": 1590
     },
     {
-      "epoch": 1.651186790505676,
-      "grad_norm": 0.4675295650959015,
-      "learning_rate": 8.860186553058066e-06,
-      "loss": 0.3992,
-      "step": 1600
     },
     {
-      "epoch": 1.6615067079463364,
-      "grad_norm": 0.4954163730144501,
-      "learning_rate": 8.749542400387861e-06,
-      "loss": 0.3986,
-      "step": 1610
     },
     {
-      "epoch": 1.671826625386997,
-      "grad_norm": 0.4895382523536682,
-      "learning_rate": 8.639053548652183e-06,
-      "loss": 0.3949,
-      "step": 1620
     },
     {
-      "epoch": 1.6821465428276574,
-      "grad_norm": 0.49679800868034363,
-      "learning_rate": 8.528733720045162e-06,
-      "loss": 0.4042,
-      "step": 1630
     },
     {
-      "epoch": 1.6924664602683177,
-      "grad_norm": 0.470292866230011,
-      "learning_rate": 8.418596615769048e-06,
-      "loss": 0.3977,
-      "step": 1640
     },
     {
-      "epoch": 1.7027863777089784,
-      "grad_norm": 0.46729475259780884,
-      "learning_rate": 8.308655914332599e-06,
-      "loss": 0.4022,
-      "step": 1650
     },
     {
-      "epoch": 1.7131062951496387,
-      "grad_norm": 0.49843648076057434,
-      "learning_rate": 8.198925269852251e-06,
-      "loss": 0.3953,
-      "step": 1660
     },
     {
-      "epoch": 1.7234262125902993,
-      "grad_norm": 0.4577590227127075,
-      "learning_rate": 8.089418310356379e-06,
-      "loss": 0.398,
-      "step": 1670
     },
     {
-      "epoch": 1.7337461300309598,
-      "grad_norm": 0.45520010590553284,
-      "learning_rate": 7.980148636092719e-06,
-      "loss": 0.3986,
-      "step": 1680
     },
     {
-      "epoch": 1.74406604747162,
-      "grad_norm": 0.48741379380226135,
-      "learning_rate": 7.871129817839304e-06,
-      "loss": 0.3926,
-      "step": 1690
     },
     {
-      "epoch": 1.7543859649122808,
-      "grad_norm": 0.47943034768104553,
-      "learning_rate": 7.762375395219045e-06,
-      "loss": 0.403,
-      "step": 1700
     },
     {
-      "epoch": 1.7647058823529411,
-      "grad_norm": 0.4822390675544739,
-      "learning_rate": 7.653898875018151e-06,
-      "loss": 0.3967,
-      "step": 1710
     },
     {
-      "epoch": 1.7750257997936016,
-      "grad_norm": 0.47492411732673645,
-      "learning_rate": 7.545713729508673e-06,
-      "loss": 0.3955,
-      "step": 1720
     },
     {
-      "epoch": 1.7853457172342622,
-      "grad_norm": 0.48685282468795776,
-      "learning_rate": 7.437833394775283e-06,
-      "loss": 0.3974,
-      "step": 1730
     },
     {
-      "epoch": 1.7956656346749225,
-      "grad_norm": 0.47495120763778687,
-      "learning_rate": 7.330271269046614e-06,
-      "loss": 0.3997,
-      "step": 1740
     },
     {
-      "epoch": 1.8059855521155832,
-      "grad_norm": 0.4861559271812439,
-      "learning_rate": 7.223040711031225e-06,
-      "loss": 0.3972,
-      "step": 1750
     },
     {
-      "epoch": 1.8163054695562435,
-      "grad_norm": 0.4717768728733063,
-      "learning_rate": 7.116155038258531e-06,
-      "loss": 0.3963,
-      "step": 1760
     },
     {
-      "epoch": 1.826625386996904,
-      "grad_norm": 0.47078821063041687,
-      "learning_rate": 7.009627525424836e-06,
-      "loss": 0.3962,
-      "step": 1770
     },
     {
-      "epoch": 1.8369453044375645,
-      "grad_norm": 0.4606710374355316,
-      "learning_rate": 6.903471402744662e-06,
-      "loss": 0.3929,
-      "step": 1780
     },
     {
-      "epoch": 1.8472652218782248,
-      "grad_norm": 0.45694735646247864,
-      "learning_rate": 6.797699854307631e-06,
-      "loss": 0.3897,
-      "step": 1790
     },
     {
-      "epoch": 1.8575851393188856,
-      "grad_norm": 0.4747222661972046,
-      "learning_rate": 6.692326016441054e-06,
-      "loss": 0.3904,
-      "step": 1800
     }
   ],
-  "logging_steps": 10,
-  "max_steps": 2907,
   "num_input_tokens_seen": 0,
-  "num_train_epochs": 3,
   "save_steps": 200,
   "stateful_callbacks": {
     "TrainerControl": {
@@ -1286,8 +2406,8 @@
       "attributes": {}
     }
   },
-  "total_flos": 8.247296645602371e+19,
-  "train_batch_size": 2,
   "trial_name": null,
   "trial_params": null
 }

 {
   "best_metric": null,
   "best_model_checkpoint": null,
+  "epoch": 1.9293516810895164,
   "eval_steps": 500,
+  "global_step": 6800,
   "is_hyper_param_search": false,
   "is_local_process_zero": true,
   "is_world_process_zero": true,
   "log_history": [
     {
+      "epoch": 0.005674563767910342,
+      "grad_norm": 1.8945719003677368,
+      "learning_rate": 2.830188679245283e-06,
+      "loss": 0.9878,
+      "step": 20
     },
     {
+      "epoch": 0.011349127535820683,
+      "grad_norm": 0.8699278235435486,
+      "learning_rate": 5.660377358490566e-06,
+      "loss": 0.9338,
+      "step": 40
     },
     {
+      "epoch": 0.017023691303731027,
+      "grad_norm": 0.9612842798233032,
+      "learning_rate": 8.49056603773585e-06,
+      "loss": 0.8992,
+      "step": 60
     },
     {
+      "epoch": 0.022698255071641367,
+      "grad_norm": 1.0209581851959229,
+      "learning_rate": 1.1320754716981132e-05,
+      "loss": 0.8802,
+      "step": 80
     },
     {
+      "epoch": 0.02837281883955171,
+      "grad_norm": 1.1397087574005127,
+      "learning_rate": 1.4150943396226415e-05,
+      "loss": 0.8636,
+      "step": 100
     },
     {
+      "epoch": 0.034047382607462054,
+      "grad_norm": 1.0688011646270752,
+      "learning_rate": 1.69811320754717e-05,
+      "loss": 0.8589,
+      "step": 120
     },
     {
+      "epoch": 0.039721946375372394,
+      "grad_norm": 1.0701323747634888,
+      "learning_rate": 1.981132075471698e-05,
+      "loss": 0.8445,
+      "step": 140
     },
     {
+      "epoch": 0.045396510143282734,
+      "grad_norm": 1.0749995708465576,
+      "learning_rate": 2.2641509433962265e-05,
+      "loss": 0.8438,
+      "step": 160
     },
     {
+      "epoch": 0.051071073911193074,
+      "grad_norm": 1.2973322868347168,
+      "learning_rate": 2.547169811320755e-05,
+      "loss": 0.8399,
+      "step": 180
     },
     {
+      "epoch": 0.05674563767910342,
+      "grad_norm": 0.9941120743751526,
+      "learning_rate": 2.830188679245283e-05,
+      "loss": 0.8459,
+      "step": 200
     },
     {
+      "epoch": 0.06242020144701376,
+      "grad_norm": 1.1092499494552612,
+      "learning_rate": 2.9999898623711896e-05,
+      "loss": 0.8396,
+      "step": 220
     },
     {
+      "epoch": 0.06809476521492411,
+      "grad_norm": 1.10667085647583,
+      "learning_rate": 2.999875815620755e-05,
+      "loss": 0.8403,
+      "step": 240
     },
     {
+      "epoch": 0.07376932898283445,
+      "grad_norm": 1.0986227989196777,
+      "learning_rate": 2.999635059750628e-05,
+      "loss": 0.8296,
+      "step": 260
     },
     {
+      "epoch": 0.07944389275074479,
+      "grad_norm": 0.9648028612136841,
+      "learning_rate": 2.9992676150998032e-05,
+      "loss": 0.8187,
+      "step": 280
     },
     {
+      "epoch": 0.08511845651865513,
+      "grad_norm": 0.8029258251190186,
+      "learning_rate": 2.998773512709909e-05,
+      "loss": 0.8224,
+      "step": 300
     },
     {
+      "epoch": 0.09079302028656547,
+      "grad_norm": 0.888502299785614,
+      "learning_rate": 2.9981527943225862e-05,
+      "loss": 0.8178,
+      "step": 320
     },
     {
+      "epoch": 0.09646758405447581,
+      "grad_norm": 0.7894881963729858,
+      "learning_rate": 2.997405512375964e-05,
+      "loss": 0.8153,
+      "step": 340
     },
     {
+      "epoch": 0.10214214782238615,
+      "grad_norm": 0.8492247462272644,
+      "learning_rate": 2.996531730000227e-05,
+      "loss": 0.8105,
+      "step": 360
     },
     {
+      "epoch": 0.1078167115902965,
+      "grad_norm": 0.8247759938240051,
+      "learning_rate": 2.9955315210122842e-05,
+      "loss": 0.8,
+      "step": 380
     },
     {
+      "epoch": 0.11349127535820684,
+      "grad_norm": 0.8270812034606934,
+      "learning_rate": 2.99440496990953e-05,
+      "loss": 0.802,
+      "step": 400
     },
     {
+      "epoch": 0.11916583912611718,
+      "grad_norm": 0.8336136937141418,
+      "learning_rate": 2.9931521718627107e-05,
+      "loss": 0.7932,
+      "step": 420
     },
     {
+      "epoch": 0.12484040289402752,
+      "grad_norm": 0.7927630543708801,
+      "learning_rate": 2.991773232707879e-05,
+      "loss": 0.7903,
+      "step": 440
     },
     {
+      "epoch": 0.13051496666193788,
+      "grad_norm": 0.8075955510139465,
+      "learning_rate": 2.9902682689374578e-05,
+      "loss": 0.7897,
+      "step": 460
     },
     {
+      "epoch": 0.13618953042984822,
+      "grad_norm": 0.7381598353385925,
+      "learning_rate": 2.9886374076903945e-05,
+      "loss": 0.785,
+      "step": 480
     },
     {
+      "epoch": 0.14186409419775856,
+      "grad_norm": 0.799022912979126,
+      "learning_rate": 2.986880786741426e-05,
+      "loss": 0.7862,
+      "step": 500
     },
     {
+      "epoch": 0.1475386579656689,
+      "grad_norm": 0.7515665292739868,
+      "learning_rate": 2.9849985544894333e-05,
+      "loss": 0.7845,
+      "step": 520
     },
     {
+      "epoch": 0.15321322173357924,
+      "grad_norm": 0.8161646723747253,
+      "learning_rate": 2.982990869944908e-05,
+      "loss": 0.7745,
+      "step": 540
     },
     {
+      "epoch": 0.15888778550148958,
+      "grad_norm": 0.671816885471344,
+      "learning_rate": 2.9808579027165204e-05,
+      "loss": 0.7786,
+      "step": 560
     },
     {
+      "epoch": 0.16456234926939992,
+      "grad_norm": 0.7310769557952881,
+      "learning_rate": 2.978599832996788e-05,
+      "loss": 0.7742,
+      "step": 580
     },
     {
+      "epoch": 0.17023691303731026,
+      "grad_norm": 0.7568747401237488,
+      "learning_rate": 2.9762168515468548e-05,
+      "loss": 0.7691,
+      "step": 600
     },
     {
+      "epoch": 0.1759114768052206,
+      "grad_norm": 0.6345218420028687,
+      "learning_rate": 2.973709159680375e-05,
+      "loss": 0.7695,
+      "step": 620
     },
     {
+      "epoch": 0.18158604057313094,
+      "grad_norm": 0.7218050360679626,
+      "learning_rate": 2.9710769692465073e-05,
+      "loss": 0.7681,
+      "step": 640
     },
     {
+      "epoch": 0.18726060434104128,
+      "grad_norm": 0.7665095925331116,
+      "learning_rate": 2.9683205026120163e-05,
+      "loss": 0.7667,
+      "step": 660
     },
     {
+      "epoch": 0.19293516810895162,
+      "grad_norm": 0.6717973947525024,
+      "learning_rate": 2.9654399926424884e-05,
+      "loss": 0.7684,
+      "step": 680
     },
     {
+      "epoch": 0.19860973187686196,
+      "grad_norm": 0.7454754114151001,
+      "learning_rate": 2.9624356826826577e-05,
+      "loss": 0.7622,
+      "step": 700
     },
     {
+      "epoch": 0.2042842956447723,
+      "grad_norm": 0.6865426898002625,
+      "learning_rate": 2.9593078265358498e-05,
+      "loss": 0.761,
+      "step": 720
     },
     {
+      "epoch": 0.20995885941268266,
+      "grad_norm": 0.7075285315513611,
+      "learning_rate": 2.956056688442541e-05,
+      "loss": 0.7578,
+      "step": 740
     },
     {
+      "epoch": 0.215633423180593,
+      "grad_norm": 0.7438149452209473,
+      "learning_rate": 2.9526825430580337e-05,
+      "loss": 0.7571,
+      "step": 760
     },
     {
+      "epoch": 0.22130798694850334,
+      "grad_norm": 0.6830400228500366,
+      "learning_rate": 2.949185675429254e-05,
+      "loss": 0.759,
+      "step": 780
     },
     {
+      "epoch": 0.22698255071641368,
+      "grad_norm": 0.7147162556648254,
+      "learning_rate": 2.9455663809706725e-05,
+      "loss": 0.756,
+      "step": 800
     },
     {
+      "epoch": 0.23265711448432402,
+      "grad_norm": 0.7116013765335083,
+      "learning_rate": 2.9418249654393443e-05,
+      "loss": 0.7538,
+      "step": 820
     },
     {
+      "epoch": 0.23833167825223436,
+      "grad_norm": 0.64736407995224,
+      "learning_rate": 2.9379617449090847e-05,
+      "loss": 0.7513,
+      "step": 840
     },
     {
+      "epoch": 0.2440062420201447,
+      "grad_norm": 0.6453843116760254,
+      "learning_rate": 2.93397704574376e-05,
+      "loss": 0.7538,
+      "step": 860
     },
     {
+      "epoch": 0.24968080578805504,
+      "grad_norm": 0.6253499388694763,
+      "learning_rate": 2.929871204569722e-05,
+      "loss": 0.7463,
+      "step": 880
     },
     {
+      "epoch": 0.2553553695559654,
+      "grad_norm": 0.6677010655403137,
+      "learning_rate": 2.9256445682473683e-05,
+      "loss": 0.7419,
+      "step": 900
     },
     {
+      "epoch": 0.26102993332387575,
+      "grad_norm": 0.7070403695106506,
+      "learning_rate": 2.9212974938418385e-05,
+      "loss": 0.7449,
+      "step": 920
     },
     {
+      "epoch": 0.26670449709178606,
+      "grad_norm": 0.6784743070602417,
+      "learning_rate": 2.9168303485928495e-05,
+      "loss": 0.7453,
+      "step": 940
     },
     {
+      "epoch": 0.27237906085969643,
+      "grad_norm": 0.6076740026473999,
+      "learning_rate": 2.912243509883673e-05,
+      "loss": 0.7457,
+      "step": 960
     },
     {
+      "epoch": 0.27805362462760674,
+      "grad_norm": 0.6722409129142761,
+      "learning_rate": 2.9075373652092535e-05,
+      "loss": 0.7373,
+      "step": 980
     },
     {
+      "epoch": 0.2837281883955171,
+      "grad_norm": 0.7188818454742432,
+      "learning_rate": 2.9027123121434714e-05,
+      "loss": 0.7343,
+      "step": 1000
     },
     {
+      "epoch": 0.2894027521634274,
+      "grad_norm": 0.657289981842041,
+      "learning_rate": 2.897768758305558e-05,
+      "loss": 0.7336,
+      "step": 1020
     },
     {
+      "epoch": 0.2950773159313378,
+      "grad_norm": 0.6076385378837585,
+      "learning_rate": 2.892707121325658e-05,
+      "loss": 0.7331,
+      "step": 1040
     },
     {
+      "epoch": 0.3007518796992481,
+      "grad_norm": 0.6217896342277527,
+      "learning_rate": 2.8875278288095507e-05,
+      "loss": 0.7339,
+      "step": 1060
     },
     {
+      "epoch": 0.30642644346715847,
+      "grad_norm": 0.6453694701194763,
+      "learning_rate": 2.882231318302523e-05,
+      "loss": 0.7334,
+      "step": 1080
     },
     {
+      "epoch": 0.3121010072350688,
+      "grad_norm": 0.6069263219833374,
+      "learning_rate": 2.8768180372524093e-05,
+      "loss": 0.734,
+      "step": 1100
     },
     {
+      "epoch": 0.31777557100297915,
+      "grad_norm": 0.6342785358428955,
+      "learning_rate": 2.8712884429717873e-05,
+      "loss": 0.7254,
+      "step": 1120
     },
     {
+      "epoch": 0.32345013477088946,
+      "grad_norm": 0.5936433672904968,
+      "learning_rate": 2.8656430025993464e-05,
+      "loss": 0.7232,
+      "step": 1140
     },
     {
+      "epoch": 0.32912469853879983,
+      "grad_norm": 0.5988269448280334,
+      "learning_rate": 2.8598821930604252e-05,
+      "loss": 0.726,
+      "step": 1160
     },
     {
+      "epoch": 0.3347992623067102,
+      "grad_norm": 0.6247944235801697,
+      "learning_rate": 2.8540065010267183e-05,
+      "loss": 0.729,
+      "step": 1180
     },
     {
+      "epoch": 0.3404738260746205,
+      "grad_norm": 0.6017037034034729,
+      "learning_rate": 2.848016422875164e-05,
+      "loss": 0.7216,
+      "step": 1200
     },
     {
+      "epoch": 0.3461483898425309,
+      "grad_norm": 0.7368952631950378,
+      "learning_rate": 2.84191246464601e-05,
+      "loss": 0.7331,
+      "step": 1220
     },
     {
+      "epoch": 0.3518229536104412,
+      "grad_norm": 0.6655734777450562,
+      "learning_rate": 2.835695142000064e-05,
+      "loss": 0.7233,
+      "step": 1240
     },
     {
+      "epoch": 0.35749751737835156,
+      "grad_norm": 0.6325275301933289,
+      "learning_rate": 2.8293649801751288e-05,
+      "loss": 0.7208,
+      "step": 1260
     },
     {
+      "epoch": 0.36317208114626187,
+      "grad_norm": 0.6046157479286194,
+      "learning_rate": 2.822922513941634e-05,
+      "loss": 0.7156,
+      "step": 1280
     },
     {
+      "epoch": 0.36884664491417224,
+      "grad_norm": 0.6081031560897827,
+      "learning_rate": 2.816368287557454e-05,
+      "loss": 0.722,
+      "step": 1300
     },
     {
+      "epoch": 0.37452120868208255,
+      "grad_norm": 0.6153631806373596,
+      "learning_rate": 2.809702854721934e-05,
+      "loss": 0.7171,
+      "step": 1320
     },
     {
+      "epoch": 0.3801957724499929,
+      "grad_norm": 0.6361656188964844,
+      "learning_rate": 2.8029267785291092e-05,
+      "loss": 0.7134,
+      "step": 1340
     },
     {
+      "epoch": 0.38587033621790323,
+      "grad_norm": 0.6033869981765747,
+      "learning_rate": 2.796040631420139e-05,
+      "loss": 0.7171,
+      "step": 1360
     },
     {
+      "epoch": 0.3915448999858136,
+      "grad_norm": 0.6300106644630432,
+      "learning_rate": 2.789044995134944e-05,
+      "loss": 0.7139,
+      "step": 1380
     },
     {
+      "epoch": 0.3972194637537239,
+      "grad_norm": 0.5989068150520325,
+      "learning_rate": 2.781940460663062e-05,
+      "loss": 0.7142,
+      "step": 1400
     },
     {
+      "epoch": 0.4028940275216343,
+      "grad_norm": 0.5790150761604309,
+      "learning_rate": 2.774727628193721e-05,
+      "loss": 0.7126,
+      "step": 1420
     },
     {
+      "epoch": 0.4085685912895446,
+      "grad_norm": 0.5948804616928101,
+      "learning_rate": 2.7674071070651378e-05,
+      "loss": 0.7103,
+      "step": 1440
     },
     {
+      "epoch": 0.41424315505745496,
+      "grad_norm": 0.6838712096214294,
+      "learning_rate": 2.7599795157130364e-05,
+      "loss": 0.7169,
+      "step": 1460
     },
     {
+      "epoch": 0.4199177188253653,
+      "grad_norm": 0.6502018570899963,
+      "learning_rate": 2.7524454816184076e-05,
+      "loss": 0.7094,
+      "step": 1480
     },
     {
+      "epoch": 0.42559228259327564,
+      "grad_norm": 0.6322967410087585,
+      "learning_rate": 2.7448056412544956e-05,
+      "loss": 0.7134,
+      "step": 1500
     },
     {
+      "epoch": 0.431266846361186,
+      "grad_norm": 0.5761287212371826,
+      "learning_rate": 2.7370606400330334e-05,
+      "loss": 0.7067,
+      "step": 1520
     },
     {
+      "epoch": 0.4369414101290963,
+      "grad_norm": 0.6147580742835999,
+      "learning_rate": 2.729211132249713e-05,
+      "loss": 0.7078,
+      "step": 1540
     },
     {
+      "epoch": 0.4426159738970067,
+      "grad_norm": 0.6231666207313538,
+      "learning_rate": 2.7212577810289157e-05,
+      "loss": 0.7066,
+      "step": 1560
     },
     {
+      "epoch": 0.448290537664917,
+      "grad_norm": 0.5739862322807312,
+      "learning_rate": 2.713201258267689e-05,
+      "loss": 0.708,
+      "step": 1580
     },
     {
+      "epoch": 0.45396510143282737,
+      "grad_norm": 0.7059602737426758,
+      "learning_rate": 2.7050422445789843e-05,
+      "loss": 0.7043,
+      "step": 1600
     },
     {
+      "epoch": 0.4596396652007377,
+      "grad_norm": 0.6156895160675049,
+      "learning_rate": 2.696781429234162e-05,
+      "loss": 0.7118,
+      "step": 1620
     },
     {
+      "epoch": 0.46531422896864805,
+      "grad_norm": 0.5444714426994324,
+      "learning_rate": 2.6884195101047567e-05,
+      "loss": 0.7031,
+      "step": 1640
     },
     {
+      "epoch": 0.47098879273655836,
+      "grad_norm": 0.6431369185447693,
+      "learning_rate": 2.6799571936035284e-05,
+      "loss": 0.7056,
+      "step": 1660
     },
     {
+      "epoch": 0.4766633565044687,
+      "grad_norm": 0.6375367641448975,
+      "learning_rate": 2.671395194624779e-05,
+      "loss": 0.6991,
+      "step": 1680
     },
     {
+      "epoch": 0.48233792027237904,
+      "grad_norm": 0.6311667561531067,
+      "learning_rate": 2.6627342364839604e-05,
+      "loss": 0.6991,
+      "step": 1700
     },
     {
+      "epoch": 0.4880124840402894,
+      "grad_norm": 0.580328643321991,
+      "learning_rate": 2.6539750508565683e-05,
+      "loss": 0.7027,
+      "step": 1720
     },
     {
+      "epoch": 0.4936870478081997,
+      "grad_norm": 0.6254743933677673,
+      "learning_rate": 2.6451183777163316e-05,
+      "loss": 0.6977,
+      "step": 1740
     },
     {
+      "epoch": 0.4993616115761101,
+      "grad_norm": 0.8747753500938416,
+      "learning_rate": 2.636164965272699e-05,
+      "loss": 0.6974,
+      "step": 1760
     },
     {
+      "epoch": 0.5050361753440205,
+      "grad_norm": 0.5931680798530579,
+      "learning_rate": 2.6271155699076305e-05,
+      "loss": 0.7001,
+      "step": 1780
     },
     {
+      "epoch": 0.5107107391119308,
+      "grad_norm": 0.5763223767280579,
+      "learning_rate": 2.6179709561116983e-05,
+      "loss": 0.7023,
+      "step": 1800
     },
     {
+      "epoch": 0.5163853028798411,
+      "grad_norm": 0.5211492776870728,
+      "learning_rate": 2.6087318964195032e-05,
+      "loss": 0.6957,
+      "step": 1820
     },
     {
+      "epoch": 0.5220598666477515,
+      "grad_norm": 0.5684000253677368,
+      "learning_rate": 2.59939917134441e-05,
+      "loss": 0.6916,
+      "step": 1840
     },
     {
+      "epoch": 0.5277344304156618,
+      "grad_norm": 0.6029589176177979,
+      "learning_rate": 2.5899735693126113e-05,
+      "loss": 0.6942,
+      "step": 1860
     },
     {
+      "epoch": 0.5334089941835721,
+      "grad_norm": 0.5765926837921143,
+      "learning_rate": 2.5804558865965206e-05,
+      "loss": 0.6973,
+      "step": 1880
     },
     {
+      "epoch": 0.5390835579514824,
+      "grad_norm": 0.5227144956588745,
+      "learning_rate": 2.5708469272475044e-05,
+      "loss": 0.6929,
+      "step": 1900
     },
     {
+      "epoch": 0.5447581217193929,
+      "grad_norm": 0.6175386309623718,
+      "learning_rate": 2.5611475030279546e-05,
+      "loss": 0.6908,
+      "step": 1920
     },
     {
+      "epoch": 0.5504326854873032,
+      "grad_norm": 0.5724866986274719,
+      "learning_rate": 2.5513584333427125e-05,
+      "loss": 0.6893,
+      "step": 1940
     },
     {
+      "epoch": 0.5561072492552135,
+      "grad_norm": 0.5964395403862,
+      "learning_rate": 2.541480545169846e-05,
+      "loss": 0.6944,
+      "step": 1960
     },
     {
+      "epoch": 0.5617818130231238,
+      "grad_norm": 0.6019209027290344,
+      "learning_rate": 2.5315146729907827e-05,
+      "loss": 0.6899,
+      "step": 1980
     },
     {
+      "epoch": 0.5674563767910342,
+      "grad_norm": 0.6371375918388367,
+      "learning_rate": 2.521461658719819e-05,
+      "loss": 0.6904,
+      "step": 2000
     },
     {
+      "epoch": 0.5731309405589445,
+      "grad_norm": 0.5762882232666016,
+      "learning_rate": 2.5113223516329924e-05,
+      "loss": 0.6887,
+      "step": 2020
     },
     {
+      "epoch": 0.5788055043268548,
+      "grad_norm": 0.591663122177124,
+      "learning_rate": 2.501097608296334e-05,
+      "loss": 0.6894,
+      "step": 2040
     },
     {
+      "epoch": 0.5844800680947652,
+      "grad_norm": 0.5833630561828613,
+      "learning_rate": 2.4907882924935072e-05,
+      "loss": 0.6866,
+      "step": 2060
     },
     {
+      "epoch": 0.5901546318626756,
+      "grad_norm": 0.5615355968475342,
+      "learning_rate": 2.4803952751528363e-05,
+      "loss": 0.6927,
+      "step": 2080
     },
     {
+      "epoch": 0.5958291956305859,
+      "grad_norm": 0.5507014989852905,
+      "learning_rate": 2.4699194342737295e-05,
+      "loss": 0.6934,
+      "step": 2100
     },
     {
+      "epoch": 0.6015037593984962,
+      "grad_norm": 0.5132161974906921,
+      "learning_rate": 2.459361654852505e-05,
+      "loss": 0.688,
+      "step": 2120
     },
     {
+      "epoch": 0.6071783231664066,
+      "grad_norm": 0.5238850116729736,
+      "learning_rate": 2.4487228288076293e-05,
+      "loss": 0.6804,
+      "step": 2140
     },
     {
+      "epoch": 0.6128528869343169,
+      "grad_norm": 0.5849164724349976,
+      "learning_rate": 2.438003854904366e-05,
+      "loss": 0.6911,
+      "step": 2160
     },
     {
+      "epoch": 0.6185274507022273,
+      "grad_norm": 0.5290674567222595,
+      "learning_rate": 2.4272056386788485e-05,
+      "loss": 0.6838,
+      "step": 2180
     },
     {
+      "epoch": 0.6242020144701376,
+      "grad_norm": 0.5804121494293213,
+      "learning_rate": 2.4163290923615814e-05,
+      "loss": 0.6894,
+      "step": 2200
     },
     {
+      "epoch": 0.629876578238048,
+      "grad_norm": 0.5559779405593872,
+      "learning_rate": 2.4053751348003757e-05,
+      "loss": 0.6859,
+      "step": 2220
     },
     {
+      "epoch": 0.6355511420059583,
+      "grad_norm": 0.5486791133880615,
+      "learning_rate": 2.394344691382723e-05,
+      "loss": 0.6836,
+      "step": 2240
     },
     {
+      "epoch": 0.6412257057738686,
+      "grad_norm": 0.5544127225875854,
+      "learning_rate": 2.3832386939576214e-05,
+      "loss": 0.681,
+      "step": 2260
     },
     {
+      "epoch": 0.6469002695417789,
+      "grad_norm": 0.5256103277206421,
+      "learning_rate": 2.3720580807568513e-05,
+      "loss": 0.6823,
+      "step": 2280
     },
     {
+      "epoch": 0.6525748333096894,
+      "grad_norm": 0.5488288402557373,
+      "learning_rate": 2.3608037963157142e-05,
+      "loss": 0.6818,
+      "step": 2300
     },
     {
+      "epoch": 0.6582493970775997,
+      "grad_norm": 0.5254908204078674,
+      "learning_rate": 2.3494767913932393e-05,
+      "loss": 0.6774,
+      "step": 2320
     },
     {
+      "epoch": 0.66392396084551,
+      "grad_norm": 0.5880591869354248,
+      "learning_rate": 2.338078022891864e-05,
+      "loss": 0.6795,
+      "step": 2340
     },
     {
+      "epoch": 0.6695985246134204,
+      "grad_norm": 0.5331950783729553,
+      "learning_rate": 2.3266084537765924e-05,
+      "loss": 0.6777,
+      "step": 2360
     },
     {
+      "epoch": 0.6752730883813307,
+      "grad_norm": 0.5736955404281616,
+      "learning_rate": 2.3150690529936475e-05,
+      "loss": 0.6792,
+      "step": 2380
     },
     {
+      "epoch": 0.680947652149241,
+      "grad_norm": 0.5705032348632812,
+      "learning_rate": 2.303460795388613e-05,
+      "loss": 0.6736,
+      "step": 2400
     },
     {
+      "epoch": 0.6866222159171513,
+      "grad_norm": 0.569355845451355,
+      "learning_rate": 2.2917846616240784e-05,
+      "loss": 0.6767,
+      "step": 2420
     },
     {
+      "epoch": 0.6922967796850618,
+      "grad_norm": 1.2819143533706665,
+      "learning_rate": 2.2800416380967952e-05,
+      "loss": 0.6772,
+      "step": 2440
     },
     {
+      "epoch": 0.6979713434529721,
+      "grad_norm": 0.5238373279571533,
+      "learning_rate": 2.268232716854343e-05,
+      "loss": 0.674,
+      "step": 2460
     },
     {
+      "epoch": 0.7036459072208824,
+      "grad_norm": 0.5886688828468323,
+      "learning_rate": 2.2563588955113246e-05,
+      "loss": 0.6757,
+      "step": 2480
     },
     {
+      "epoch": 0.7093204709887927,
+      "grad_norm": 0.5450348854064941,
+      "learning_rate": 2.244421177165085e-05,
+      "loss": 0.6691,
+      "step": 2500
     },
     {
+      "epoch": 0.7149950347567031,
+      "grad_norm": 0.5553733706474304,
+      "learning_rate": 2.232420570310974e-05,
+      "loss": 0.6751,
+      "step": 2520
     },
     {
+      "epoch": 0.7206695985246134,
+      "grad_norm": 0.5076789259910583,
+      "learning_rate": 2.2203580887571423e-05,
+      "loss": 0.6739,
+      "step": 2540
     },
     {
+      "epoch": 0.7263441622925237,
+      "grad_norm": 0.5153952240943909,
+      "learning_rate": 2.2082347515389027e-05,
+      "loss": 0.6734,
+      "step": 2560
     },
     {
+      "epoch": 0.732018726060434,
+      "grad_norm": 0.5176730155944824,
+      "learning_rate": 2.1960515828326372e-05,
+      "loss": 0.6706,
+      "step": 2580
     },
     {
+      "epoch": 0.7376932898283445,
+      "grad_norm": 0.526030421257019,
+      "learning_rate": 2.1838096118692768e-05,
+      "loss": 0.6694,
+      "step": 2600
     },
     {
+      "epoch": 0.7433678535962548,
+      "grad_norm": 0.6030652523040771,
+      "learning_rate": 2.1715098728473518e-05,
+      "loss": 0.6707,
+      "step": 2620
     },
     {
+      "epoch": 0.7490424173641651,
+      "grad_norm": 0.6607082486152649,
+      "learning_rate": 2.1591534048456225e-05,
+      "loss": 0.6668,
+      "step": 2640
     },
     {
+      "epoch": 0.7547169811320755,
+      "grad_norm": 0.5300272107124329,
+      "learning_rate": 2.1467412517352996e-05,
+      "loss": 0.6696,
+      "step": 2660
     },
     {
+      "epoch": 0.7603915448999858,
+      "grad_norm": 0.5344169735908508,
+      "learning_rate": 2.1342744620918568e-05,
+      "loss": 0.6736,
+      "step": 2680
     },
     {
+      "epoch": 0.7660661086678962,
+      "grad_norm": 0.5058417916297913,
+      "learning_rate": 2.121754089106448e-05,
+      "loss": 0.6681,
+      "step": 2700
     },
     {
+      "epoch": 0.7717406724358065,
+      "grad_norm": 0.5440433621406555,
+      "learning_rate": 2.1091811904969344e-05,
+      "loss": 0.6702,
+      "step": 2720
     },
     {
+      "epoch": 0.7774152362037169,
+      "grad_norm": 0.5361486077308655,
+      "learning_rate": 2.096556828418528e-05,
+      "loss": 0.6686,
+      "step": 2740
     },
     {
+      "epoch": 0.7830897999716272,
+      "grad_norm": 0.6350403428077698,
+      "learning_rate": 2.0838820693740603e-05,
+      "loss": 0.6678,
+      "step": 2760
     },
     {
+      "epoch": 0.7887643637395375,
+      "grad_norm": 0.5326098203659058,
+      "learning_rate": 2.0711579841238875e-05,
+      "loss": 0.6711,
+      "step": 2780
     },
     {
+      "epoch": 0.7944389275074478,
+      "grad_norm": 0.540676474571228,
+      "learning_rate": 2.058385647595429e-05,
+      "loss": 0.6705,
+      "step": 2800
     },
     {
+      "epoch": 0.8001134912753582,
+      "grad_norm": 0.4930702745914459,
+      "learning_rate": 2.045566138792361e-05,
+      "loss": 0.6683,
+      "step": 2820
     },
     {
+      "epoch": 0.8057880550432686,
+      "grad_norm": 0.5729920268058777,
+      "learning_rate": 2.032700540703459e-05,
+      "loss": 0.6646,
+      "step": 2840
     },
     {
+      "epoch": 0.8114626188111789,
+      "grad_norm": 0.5179927945137024,
+      "learning_rate": 2.0197899402111127e-05,
+      "loss": 0.6632,
+      "step": 2860
     },
     {
+      "epoch": 0.8171371825790892,
+      "grad_norm": 0.5147942900657654,
+      "learning_rate": 2.0068354279995008e-05,
+      "loss": 0.6558,
+      "step": 2880
     },
     {
+      "epoch": 0.8228117463469996,
+      "grad_norm": 0.5044906735420227,
+      "learning_rate": 1.9938380984624533e-05,
+      "loss": 0.6634,
+      "step": 2900
     },
     {
+      "epoch": 0.8284863101149099,
+      "grad_norm": 0.5231923460960388,
+      "learning_rate": 1.9807990496109965e-05,
+      "loss": 0.6698,
+      "step": 2920
     },
     {
+      "epoch": 0.8341608738828202,
+      "grad_norm": 0.5322957634925842,
+      "learning_rate": 1.967719382980594e-05,
+      "loss": 0.6568,
+      "step": 2940
     },
     {
+      "epoch": 0.8398354376507307,
+      "grad_norm": 0.512269139289856,
+      "learning_rate": 1.9546002035380886e-05,
+      "loss": 0.6654,
+      "step": 2960
     },
     {
+      "epoch": 0.845510001418641,
+      "grad_norm": 0.508976399898529,
+      "learning_rate": 1.9414426195883558e-05,
+      "loss": 0.6552,
+      "step": 2980
     },
     {
+      "epoch": 0.8511845651865513,
+      "grad_norm": 0.5061299204826355,
+      "learning_rate": 1.9282477426806723e-05,
+      "loss": 0.6599,
+      "step": 3000
     },
     {
+      "epoch": 0.8568591289544616,
+      "grad_norm": 0.510822057723999,
+      "learning_rate": 1.9150166875148155e-05,
+      "loss": 0.6612,
+      "step": 3020
     },
     {
+      "epoch": 0.862533692722372,
+      "grad_norm": 0.5578708648681641,
+      "learning_rate": 1.9017505718468934e-05,
+      "loss": 0.658,
+      "step": 3040
     },
     {
+      "epoch": 0.8682082564902823,
+      "grad_norm": 0.5130868554115295,
+      "learning_rate": 1.888450516394914e-05,
+      "loss": 0.6541,
+      "step": 3060
     },
     {
+      "epoch": 0.8738828202581926,
+      "grad_norm": 0.5147811770439148,
+      "learning_rate": 1.8751176447441104e-05,
+      "loss": 0.6586,
+      "step": 3080
     },
     {
+      "epoch": 0.879557384026103,
+      "grad_norm": 0.5556140542030334,
+      "learning_rate": 1.861753083252021e-05,
+      "loss": 0.6535,
+      "step": 3100
     },
     {
+      "epoch": 0.8852319477940134,
+      "grad_norm": 0.509611964225769,
+      "learning_rate": 1.8483579609533318e-05,
+      "loss": 0.6537,
+      "step": 3120
     },
     {
+      "epoch": 0.8909065115619237,
+      "grad_norm": 0.5088684558868408,
+      "learning_rate": 1.834933409464499e-05,
+      "loss": 0.6562,
+      "step": 3140
     },
     {
+      "epoch": 0.896581075329834,
+      "grad_norm": 0.48405396938323975,
+      "learning_rate": 1.821480562888148e-05,
+      "loss": 0.6583,
+      "step": 3160
     },
     {
+      "epoch": 0.9022556390977443,
+      "grad_norm": 0.5087782144546509,
+      "learning_rate": 1.808000557717268e-05,
+      "loss": 0.6558,
+      "step": 3180
     },
     {
+      "epoch": 0.9079302028656547,
+      "grad_norm": 0.5303909778594971,
+      "learning_rate": 1.7944945327391957e-05,
+      "loss": 0.6517,
+      "step": 3200
     },
     {
+      "epoch": 0.913604766633565,
+      "grad_norm": 0.5164442658424377,
+      "learning_rate": 1.7809636289394185e-05,
+      "loss": 0.6529,
+      "step": 3220
     },
     {
+      "epoch": 0.9192793304014754,
+      "grad_norm": 0.5162308216094971,
+      "learning_rate": 1.7674089894051774e-05,
+      "loss": 0.6542,
+      "step": 3240
     },
     {
+      "epoch": 0.9249538941693858,
+      "grad_norm": 0.545396625995636,
+      "learning_rate": 1.753831759228903e-05,
+      "loss": 0.6527,
+      "step": 3260
     },
     {
+      "epoch": 0.9306284579372961,
+      "grad_norm": 0.5134595632553101,
+      "learning_rate": 1.740233085411477e-05,
+      "loss": 0.6555,
+      "step": 3280
     },
     {
+      "epoch": 0.9363030217052064,
+      "grad_norm": 0.48815637826919556,
+      "learning_rate": 1.7266141167653353e-05,
+      "loss": 0.6554,
+      "step": 3300
     },
     {
+      "epoch": 0.9419775854731167,
+      "grad_norm": 0.5034410953521729,
+      "learning_rate": 1.7129760038174146e-05,
+      "loss": 0.6514,
+      "step": 3320
     },
     {
+      "epoch": 0.9476521492410271,
+      "grad_norm": 0.5322323441505432,
+      "learning_rate": 1.6993198987119576e-05,
+      "loss": 0.6533,
+      "step": 3340
     },
     {
+      "epoch": 0.9533267130089375,
+      "grad_norm": 0.48363253474235535,
+      "learning_rate": 1.6856469551131805e-05,
+      "loss": 0.6468,
+      "step": 3360
     },
     {
+      "epoch": 0.9590012767768478,
+      "grad_norm": 0.4600164592266083,
+      "learning_rate": 1.67195832810781e-05,
+      "loss": 0.6472,
+      "step": 3380
     },
     {
+      "epoch": 0.9646758405447581,
+      "grad_norm": 0.49600768089294434,
+      "learning_rate": 1.6582551741075033e-05,
+      "loss": 0.6467,
+      "step": 3400
     },
     {
+      "epoch": 0.9703504043126685,
+      "grad_norm": 0.7202423810958862,
+      "learning_rate": 1.6445386507511546e-05,
+      "loss": 0.6502,
+      "step": 3420
     },
     {
+      "epoch": 0.9760249680805788,
+      "grad_norm": 0.502703070640564,
+      "learning_rate": 1.630809916807098e-05,
+      "loss": 0.6424,
+      "step": 3440
     },
     {
+      "epoch": 0.9816995318484891,
+      "grad_norm": 0.49266818165779114,
+      "learning_rate": 1.617070132075214e-05,
+      "loss": 0.6485,
+      "step": 3460
     },
     {
+      "epoch": 0.9873740956163994,
+      "grad_norm": 0.5194821357727051,
+      "learning_rate": 1.6033204572889516e-05,
+      "loss": 0.6499,
+      "step": 3480
     },
     {
+      "epoch": 0.9930486593843099,
+      "grad_norm": 0.49109163880348206,
+      "learning_rate": 1.5895620540172682e-05,
+      "loss": 0.6506,
+      "step": 3500
     },
     {
+      "epoch": 0.9987232231522202,
+      "grad_norm": 0.5099320411682129,
+      "learning_rate": 1.575796084566503e-05,
+      "loss": 0.6466,
+      "step": 3520
     },
     {
+      "epoch": 1.0043977869201306,
+      "grad_norm": 0.5476223230361938,
+      "learning_rate": 1.562023711882182e-05,
+      "loss": 0.5924,
+      "step": 3540
     },
     {
+      "epoch": 1.010072350688041,
+      "grad_norm": 0.4934983551502228,
+      "learning_rate": 1.548246099450776e-05,
+      "loss": 0.5683,
+      "step": 3560
     },
     {
+      "epoch": 1.0157469144559512,
+      "grad_norm": 0.5262681841850281,
+      "learning_rate": 1.534464411201409e-05,
+      "loss": 0.5733,
+      "step": 3580
     },
     {
+      "epoch": 1.0214214782238615,
+      "grad_norm": 0.5271425843238831,
+      "learning_rate": 1.520679811407526e-05,
+      "loss": 0.5697,
+      "step": 3600
+    },
+    {
+      "epoch": 1.0270960419917718,
+      "grad_norm": 0.5124356150627136,
+      "learning_rate": 1.506893464588542e-05,
+      "loss": 0.5653,
+      "step": 3620
+    },
+    {
+      "epoch": 1.0327706057596822,
+      "grad_norm": 0.5131009817123413,
+      "learning_rate": 1.4931065354114584e-05,
+      "loss": 0.5669,
+      "step": 3640
+    },
+    {
+      "epoch": 1.0384451695275925,
+      "grad_norm": 0.5003370046615601,
+      "learning_rate": 1.4793201885924745e-05,
+      "loss": 0.565,
+      "step": 3660
+    },
+    {
+      "epoch": 1.044119733295503,
+      "grad_norm": 0.5440374612808228,
+      "learning_rate": 1.465535588798592e-05,
+      "loss": 0.5708,
+      "step": 3680
+    },
+    {
+      "epoch": 1.0497942970634133,
+      "grad_norm": 0.5212259292602539,
+      "learning_rate": 1.4517539005492237e-05,
+      "loss": 0.57,
+      "step": 3700
+    },
+    {
+      "epoch": 1.0554688608313236,
+      "grad_norm": 0.5004721879959106,
+      "learning_rate": 1.4379762881178182e-05,
+      "loss": 0.5692,
+      "step": 3720
+    },
+    {
+      "epoch": 1.061143424599234,
+      "grad_norm": 0.5253936648368835,
+      "learning_rate": 1.4242039154334973e-05,
+      "loss": 0.5685,
+      "step": 3740
+    },
+    {
+      "epoch": 1.0668179883671443,
+      "grad_norm": 0.5163034200668335,
+      "learning_rate": 1.410437945982732e-05,
+      "loss": 0.5706,
+      "step": 3760
+    },
+    {
+      "epoch": 1.0724925521350546,
+      "grad_norm": 0.49630168080329895,
+      "learning_rate": 1.3966795427110493e-05,
+      "loss": 0.5725,
+      "step": 3780
+    },
+    {
+      "epoch": 1.0781671159029649,
+      "grad_norm": 0.5117852091789246,
+      "learning_rate": 1.3829298679247865e-05,
+      "loss": 0.5646,
+      "step": 3800
+    },
+    {
+      "epoch": 1.0838416796708752,
+      "grad_norm": 0.5082918405532837,
+      "learning_rate": 1.369190083192902e-05,
+      "loss": 0.5705,
+      "step": 3820
+    },
+    {
+      "epoch": 1.0895162434387857,
+      "grad_norm": 0.5319990515708923,
+      "learning_rate": 1.3554613492488453e-05,
+      "loss": 0.5684,
+      "step": 3840
+    },
+    {
+      "epoch": 1.095190807206696,
+      "grad_norm": 0.5344195365905762,
+      "learning_rate": 1.3417448258924971e-05,
+      "loss": 0.5658,
+      "step": 3860
+    },
+    {
+      "epoch": 1.1008653709746063,
+      "grad_norm": 0.507433295249939,
+      "learning_rate": 1.3280416718921902e-05,
+      "loss": 0.5717,
+      "step": 3880
+    },
+    {
+      "epoch": 1.1065399347425167,
+      "grad_norm": 0.5090216398239136,
+      "learning_rate": 1.3143530448868198e-05,
+      "loss": 0.5663,
+      "step": 3900
+    },
+    {
+      "epoch": 1.112214498510427,
+      "grad_norm": 0.512146532535553,
+      "learning_rate": 1.3006801012880425e-05,
+      "loss": 0.5656,
+      "step": 3920
+    },
+    {
+      "epoch": 1.1178890622783373,
+      "grad_norm": 0.5273200869560242,
+      "learning_rate": 1.2870239961825853e-05,
+      "loss": 0.5621,
+      "step": 3940
+    },
+    {
+      "epoch": 1.1235636260462476,
+      "grad_norm": 0.5408139824867249,
+      "learning_rate": 1.2733858832346648e-05,
+      "loss": 0.5744,
+      "step": 3960
+    },
+    {
+      "epoch": 1.1292381898141581,
+      "grad_norm": 0.4986436069011688,
+      "learning_rate": 1.2597669145885231e-05,
+      "loss": 0.5704,
+      "step": 3980
+    },
+    {
+      "epoch": 1.1349127535820684,
+      "grad_norm": 0.5186699628829956,
+      "learning_rate": 1.2461682407710973e-05,
+      "loss": 0.5588,
+      "step": 4000
+    },
+    {
+      "epoch": 1.1405873173499788,
+      "grad_norm": 0.5081115365028381,
+      "learning_rate": 1.2325910105948229e-05,
+      "loss": 0.5667,
+      "step": 4020
+    },
+    {
+      "epoch": 1.146261881117889,
+      "grad_norm": 0.501616358757019,
+      "learning_rate": 1.219036371060582e-05,
+      "loss": 0.5628,
+      "step": 4040
+    },
+    {
+      "epoch": 1.1519364448857994,
+      "grad_norm": 0.5288362503051758,
+      "learning_rate": 1.2055054672608043e-05,
+      "loss": 0.5642,
+      "step": 4060
+    },
+    {
+      "epoch": 1.1576110086537097,
+      "grad_norm": 0.5392152070999146,
+      "learning_rate": 1.1919994422827326e-05,
+      "loss": 0.5606,
+      "step": 4080
+    },
+    {
+      "epoch": 1.16328557242162,
+      "grad_norm": 0.514348030090332,
+      "learning_rate": 1.1785194371118521e-05,
+      "loss": 0.5653,
+      "step": 4100
+    },
+    {
+      "epoch": 1.1689601361895305,
+      "grad_norm": 0.4942004978656769,
+      "learning_rate": 1.1650665905355014e-05,
+      "loss": 0.5622,
+      "step": 4120
+    },
+    {
+      "epoch": 1.1746346999574409,
+      "grad_norm": 0.48802751302719116,
+      "learning_rate": 1.1516420390466685e-05,
+      "loss": 0.5613,
+      "step": 4140
+    },
+    {
+      "epoch": 1.1803092637253512,
+      "grad_norm": 0.5025625228881836,
+      "learning_rate": 1.1382469167479795e-05,
+      "loss": 0.5656,
+      "step": 4160
+    },
+    {
+      "epoch": 1.1859838274932615,
+      "grad_norm": 0.5276467204093933,
+      "learning_rate": 1.1248823552558895e-05,
+      "loss": 0.5639,
+      "step": 4180
+    },
+    {
+      "epoch": 1.1916583912611718,
+      "grad_norm": 0.5035718083381653,
+      "learning_rate": 1.1115494836050861e-05,
+      "loss": 0.5612,
+      "step": 4200
+    },
+    {
+      "epoch": 1.197332955029082,
+      "grad_norm": 0.5080997347831726,
+      "learning_rate": 1.0982494281531069e-05,
+      "loss": 0.5647,
+      "step": 4220
+    },
+    {
+      "epoch": 1.2030075187969924,
+      "grad_norm": 0.505695104598999,
+      "learning_rate": 1.0849833124851846e-05,
+      "loss": 0.5681,
+      "step": 4240
+    },
+    {
+      "epoch": 1.2086820825649027,
+      "grad_norm": 0.48905614018440247,
+      "learning_rate": 1.0717522573193281e-05,
+      "loss": 0.561,
+      "step": 4260
+    },
+    {
+      "epoch": 1.2143566463328133,
+      "grad_norm": 0.49127668142318726,
+      "learning_rate": 1.0585573804116448e-05,
+      "loss": 0.5639,
+      "step": 4280
+    },
+    {
+      "epoch": 1.2200312101007236,
+      "grad_norm": 0.5206524729728699,
+      "learning_rate": 1.0453997964619112e-05,
+      "loss": 0.5594,
+      "step": 4300
+    },
+    {
+      "epoch": 1.2257057738686339,
+      "grad_norm": 0.48683062195777893,
+      "learning_rate": 1.0322806170194061e-05,
+      "loss": 0.5622,
+      "step": 4320
+    },
+    {
+      "epoch": 1.2313803376365442,
+      "grad_norm": 0.532207190990448,
+      "learning_rate": 1.0192009503890037e-05,
+      "loss": 0.5581,
+      "step": 4340
+    },
+    {
+      "epoch": 1.2370549014044545,
+      "grad_norm": 0.49200239777565,
+      "learning_rate": 1.0061619015375473e-05,
+      "loss": 0.5594,
+      "step": 4360
+    },
+    {
+      "epoch": 1.2427294651723648,
+      "grad_norm": 0.504898190498352,
+      "learning_rate": 9.931645720004995e-06,
+      "loss": 0.5622,
+      "step": 4380
+    },
+    {
+      "epoch": 1.2484040289402751,
+      "grad_norm": 0.5061923861503601,
+      "learning_rate": 9.802100597888877e-06,
+      "loss": 0.5572,
+      "step": 4400
+    },
+    {
+      "epoch": 1.2540785927081854,
+      "grad_norm": 0.4961055815219879,
+      "learning_rate": 9.672994592965409e-06,
+      "loss": 0.5609,
+      "step": 4420
+    },
+    {
+      "epoch": 1.259753156476096,
+      "grad_norm": 0.4930592477321625,
+      "learning_rate": 9.544338612076396e-06,
+      "loss": 0.5637,
+      "step": 4440
+    },
+    {
+      "epoch": 1.2654277202440063,
+      "grad_norm": 0.4978179335594177,
+      "learning_rate": 9.41614352404571e-06,
+      "loss": 0.5615,
+      "step": 4460
+    },
+    {
+      "epoch": 1.2711022840119166,
+      "grad_norm": 0.5112114548683167,
+      "learning_rate": 9.288420158761127e-06,
+      "loss": 0.558,
+      "step": 4480
+    },
+    {
+      "epoch": 1.276776847779827,
+      "grad_norm": 0.5114573240280151,
+      "learning_rate": 9.161179306259401e-06,
+      "loss": 0.5561,
+      "step": 4500
+    },
+    {
+      "epoch": 1.2824514115477372,
+      "grad_norm": 0.5023430585861206,
+      "learning_rate": 9.034431715814726e-06,
+      "loss": 0.5558,
+      "step": 4520
+    },
+    {
+      "epoch": 1.2881259753156475,
+      "grad_norm": 0.503487765789032,
+      "learning_rate": 8.908188095030655e-06,
+      "loss": 0.5607,
+      "step": 4540
+    },
+    {
+      "epoch": 1.2938005390835579,
+      "grad_norm": 0.5188455581665039,
+      "learning_rate": 8.78245910893552e-06,
+      "loss": 0.5639,
+      "step": 4560
+    },
+    {
+      "epoch": 1.2994751028514684,
+      "grad_norm": 0.5216081738471985,
+      "learning_rate": 8.657255379081438e-06,
+      "loss": 0.5584,
+      "step": 4580
+    },
+    {
+      "epoch": 1.3051496666193787,
+      "grad_norm": 0.5024508833885193,
+      "learning_rate": 8.532587482647013e-06,
+      "loss": 0.5604,
+      "step": 4600
+    },
+    {
+      "epoch": 1.310824230387289,
+      "grad_norm": 0.5100445747375488,
+      "learning_rate": 8.408465951543779e-06,
+      "loss": 0.5596,
+      "step": 4620
+    },
+    {
+      "epoch": 1.3164987941551993,
+      "grad_norm": 0.5005710124969482,
+      "learning_rate": 8.284901271526481e-06,
+      "loss": 0.5591,
+      "step": 4640
+    },
+    {
+      "epoch": 1.3221733579231096,
+      "grad_norm": 0.5151055455207825,
+      "learning_rate": 8.161903881307231e-06,
+      "loss": 0.5462,
+      "step": 4660
+    },
+    {
+      "epoch": 1.32784792169102,
+      "grad_norm": 0.4919968545436859,
+      "learning_rate": 8.039484171673628e-06,
+      "loss": 0.5523,
+      "step": 4680
+    },
+    {
+      "epoch": 1.3335224854589303,
+      "grad_norm": 0.5007758140563965,
+      "learning_rate": 7.917652484610975e-06,
+      "loss": 0.5545,
+      "step": 4700
+    },
+    {
+      "epoch": 1.3391970492268408,
+      "grad_norm": 0.4885912537574768,
+      "learning_rate": 7.796419112428583e-06,
+      "loss": 0.5582,
+      "step": 4720
+    },
+    {
+      "epoch": 1.344871612994751,
+      "grad_norm": 0.4874049127101898,
+      "learning_rate": 7.675794296890265e-06,
+      "loss": 0.5505,
+      "step": 4740
+    },
+    {
+      "epoch": 1.3505461767626614,
+      "grad_norm": 0.46998655796051025,
+      "learning_rate": 7.555788228349143e-06,
+      "loss": 0.554,
+      "step": 4760
+    },
+    {
+      "epoch": 1.3562207405305717,
+      "grad_norm": 0.4996753931045532,
+      "learning_rate": 7.436411044886753e-06,
+      "loss": 0.5513,
+      "step": 4780
+    },
+    {
+      "epoch": 1.361895304298482,
+      "grad_norm": 0.502571165561676,
+      "learning_rate": 7.31767283145657e-06,
+      "loss": 0.5547,
+      "step": 4800
+    },
+    {
+      "epoch": 1.3675698680663924,
+      "grad_norm": 0.48792627453804016,
+      "learning_rate": 7.199583619032052e-06,
+      "loss": 0.5551,
+      "step": 4820
+    },
+    {
+      "epoch": 1.3732444318343027,
+      "grad_norm": 0.48799988627433777,
+      "learning_rate": 7.082153383759222e-06,
+      "loss": 0.5524,
+      "step": 4840
+    },
+    {
+      "epoch": 1.3789189956022132,
+      "grad_norm": 0.4976406991481781,
+      "learning_rate": 6.9653920461138755e-06,
+      "loss": 0.5548,
+      "step": 4860
+    },
+    {
+      "epoch": 1.3845935593701233,
+      "grad_norm": 0.5006715655326843,
+      "learning_rate": 6.849309470063529e-06,
+      "loss": 0.5544,
+      "step": 4880
+    },
+    {
+      "epoch": 1.3902681231380338,
+      "grad_norm": 0.4864628314971924,
+      "learning_rate": 6.7339154622340754e-06,
+      "loss": 0.5483,
+      "step": 4900
+    },
+    {
+      "epoch": 1.3959426869059441,
+      "grad_norm": 0.48580724000930786,
+      "learning_rate": 6.619219771081361e-06,
+      "loss": 0.5544,
+      "step": 4920
+    },
+    {
+      "epoch": 1.4016172506738545,
+      "grad_norm": 0.5042415857315063,
+      "learning_rate": 6.505232086067607e-06,
+      "loss": 0.5504,
+      "step": 4940
+    },
+    {
+      "epoch": 1.4072918144417648,
+      "grad_norm": 0.4970082640647888,
+      "learning_rate": 6.391962036842863e-06,
+      "loss": 0.547,
+      "step": 4960
+    },
+    {
+      "epoch": 1.412966378209675,
+      "grad_norm": 0.47866857051849365,
+      "learning_rate": 6.279419192431494e-06,
+      "loss": 0.5548,
+      "step": 4980
+    },
+    {
+      "epoch": 1.4186409419775854,
+      "grad_norm": 0.4664076566696167,
+      "learning_rate": 6.167613060423789e-06,
+      "loss": 0.5454,
+      "step": 5000
+    },
+    {
+      "epoch": 1.4243155057454957,
+      "grad_norm": 0.49711087346076965,
+      "learning_rate": 6.0565530861727685e-06,
+      "loss": 0.5519,
+      "step": 5020
+    },
+    {
+      "epoch": 1.4299900695134062,
+      "grad_norm": 0.46965324878692627,
+      "learning_rate": 5.946248651996244e-06,
+      "loss": 0.5519,
+      "step": 5040
+    },
+    {
+      "epoch": 1.4356646332813165,
+      "grad_norm": 0.505743145942688,
+      "learning_rate": 5.836709076384188e-06,
+      "loss": 0.5482,
+      "step": 5060
+    },
+    {
+      "epoch": 1.4413391970492269,
+      "grad_norm": 0.5078002214431763,
+      "learning_rate": 5.727943613211521e-06,
+      "loss": 0.5575,
+      "step": 5080
+    },
+    {
+      "epoch": 1.4470137608171372,
+      "grad_norm": 0.48647207021713257,
+      "learning_rate": 5.619961450956347e-06,
+      "loss": 0.5461,
+      "step": 5100
+    },
+    {
+      "epoch": 1.4526883245850475,
+      "grad_norm": 0.4711668789386749,
+      "learning_rate": 5.5127717119237084e-06,
+      "loss": 0.5472,
+      "step": 5120
+    },
+    {
+      "epoch": 1.4583628883529578,
+      "grad_norm": 0.518395721912384,
+      "learning_rate": 5.406383451474948e-06,
+      "loss": 0.5483,
+      "step": 5140
+    },
+    {
+      "epoch": 1.464037452120868,
+      "grad_norm": 0.4849320948123932,
+      "learning_rate": 5.300805657262706e-06,
+      "loss": 0.5459,
+      "step": 5160
+    },
+    {
+      "epoch": 1.4697120158887786,
+      "grad_norm": 0.501943826675415,
+      "learning_rate": 5.1960472484716374e-06,
+      "loss": 0.5482,
+      "step": 5180
+    },
+    {
+      "epoch": 1.475386579656689,
+      "grad_norm": 0.48699691891670227,
+      "learning_rate": 5.092117075064931e-06,
+      "loss": 0.5522,
+      "step": 5200
+    },
+    {
+      "epoch": 1.4810611434245993,
+      "grad_norm": 0.48894861340522766,
+      "learning_rate": 4.989023917036667e-06,
+      "loss": 0.5502,
+      "step": 5220
+    },
+    {
+      "epoch": 1.4867357071925096,
+      "grad_norm": 0.49131521582603455,
+      "learning_rate": 4.886776483670077e-06,
+      "loss": 0.5466,
+      "step": 5240
+    },
+    {
+      "epoch": 1.49241027096042,
+      "grad_norm": 0.47139400243759155,
+      "learning_rate": 4.78538341280181e-06,
+      "loss": 0.5473,
+      "step": 5260
+    },
+    {
+      "epoch": 1.4980848347283302,
+      "grad_norm": 0.49604731798171997,
+      "learning_rate": 4.684853270092173e-06,
+      "loss": 0.5498,
+      "step": 5280
+    },
+    {
+      "epoch": 1.5037593984962405,
+      "grad_norm": 0.4864351749420166,
+      "learning_rate": 4.585194548301545e-06,
+      "loss": 0.5448,
+      "step": 5300
+    },
+    {
+      "epoch": 1.509433962264151,
+      "grad_norm": 0.48130905628204346,
+      "learning_rate": 4.486415666572874e-06,
+      "loss": 0.5469,
+      "step": 5320
+    },
+    {
+      "epoch": 1.5151085260320611,
+      "grad_norm": 0.4783124625682831,
+      "learning_rate": 4.388524969720458e-06,
+      "loss": 0.546,
+      "step": 5340
+    },
+    {
+      "epoch": 1.5207830897999717,
+      "grad_norm": 0.4969868063926697,
+      "learning_rate": 4.2915307275249585e-06,
+      "loss": 0.5453,
+      "step": 5360
+    },
+    {
+      "epoch": 1.526457653567882,
+      "grad_norm": 0.4832542836666107,
+      "learning_rate": 4.195441134034799e-06,
+      "loss": 0.5463,
+      "step": 5380
+    },
+    {
+      "epoch": 1.5321322173357923,
+      "grad_norm": 0.4712090790271759,
+      "learning_rate": 4.10026430687389e-06,
+      "loss": 0.5449,
+      "step": 5400
+    },
+    {
+      "epoch": 1.5378067811037026,
+      "grad_norm": 0.4822421967983246,
+      "learning_rate": 4.0060082865559035e-06,
+      "loss": 0.5465,
+      "step": 5420
+    },
+    {
+      "epoch": 1.543481344871613,
+      "grad_norm": 0.4809670150279999,
+      "learning_rate": 3.912681035804971e-06,
+      "loss": 0.5406,
+      "step": 5440
+    },
+    {
+      "epoch": 1.5491559086395235,
+      "grad_norm": 0.4631410539150238,
+      "learning_rate": 3.820290438883018e-06,
+      "loss": 0.5461,
+      "step": 5460
+    },
+    {
+      "epoch": 1.5548304724074336,
+      "grad_norm": 0.46498140692710876,
+      "learning_rate": 3.728844300923694e-06,
+      "loss": 0.5419,
+      "step": 5480
+    },
+    {
+      "epoch": 1.560505036175344,
+      "grad_norm": 0.4786704480648041,
+      "learning_rate": 3.6383503472730116e-06,
+      "loss": 0.5476,
+      "step": 5500
+    },
+    {
+      "epoch": 1.5661795999432544,
+      "grad_norm": 0.4655323624610901,
+      "learning_rate": 3.548816222836688e-06,
+      "loss": 0.5406,
+      "step": 5520
+    },
+    {
+      "epoch": 1.5718541637111647,
+      "grad_norm": 0.46424925327301025,
+      "learning_rate": 3.460249491434319e-06,
+      "loss": 0.5415,
+      "step": 5540
+    },
+    {
+      "epoch": 1.577528727479075,
+      "grad_norm": 0.45783787965774536,
+      "learning_rate": 3.3726576351603985e-06,
+      "loss": 0.5503,
+      "step": 5560
+    },
+    {
+      "epoch": 1.5832032912469853,
+      "grad_norm": 0.49086692929267883,
+      "learning_rate": 3.2860480537522103e-06,
+      "loss": 0.543,
+      "step": 5580
+    },
+    {
+      "epoch": 1.5888778550148959,
+      "grad_norm": 0.48474520444869995,
+      "learning_rate": 3.2004280639647122e-06,
+      "loss": 0.539,
+      "step": 5600
+    },
+    {
+      "epoch": 1.594552418782806,
+      "grad_norm": 0.5037649869918823,
+      "learning_rate": 3.115804898952434e-06,
+      "loss": 0.5415,
+      "step": 5620
+    },
+    {
+      "epoch": 1.6002269825507165,
+      "grad_norm": 0.4954313337802887,
+      "learning_rate": 3.032185707658389e-06,
+      "loss": 0.5487,
+      "step": 5640
+    },
+    {
+      "epoch": 1.6059015463186268,
+      "grad_norm": 0.4597771465778351,
+      "learning_rate": 2.949577554210157e-06,
+      "loss": 0.5445,
+      "step": 5660
+    },
+    {
+      "epoch": 1.6115761100865371,
+      "grad_norm": 0.4839852750301361,
+      "learning_rate": 2.8679874173231137e-06,
+      "loss": 0.5499,
+      "step": 5680
+    },
+    {
+      "epoch": 1.6172506738544474,
+      "grad_norm": 0.4653310179710388,
+      "learning_rate": 2.787422189710844e-06,
+      "loss": 0.5453,
+      "step": 5700
+    },
+    {
+      "epoch": 1.6229252376223577,
+      "grad_norm": 0.485579252243042,
+      "learning_rate": 2.7078886775028693e-06,
+      "loss": 0.5383,
+      "step": 5720
+    },
+    {
+      "epoch": 1.6285998013902683,
+      "grad_norm": 0.4727838337421417,
+      "learning_rate": 2.629393599669667e-06,
+      "loss": 0.5421,
+      "step": 5740
+    },
+    {
+      "epoch": 1.6342743651581784,
+      "grad_norm": 0.45239365100860596,
+      "learning_rate": 2.5519435874550434e-06,
+      "loss": 0.5357,
+      "step": 5760
+    },
+    {
+      "epoch": 1.639948928926089,
+      "grad_norm": 0.4669874310493469,
+      "learning_rate": 2.475545183815926e-06,
+      "loss": 0.5385,
+      "step": 5780
+    },
+    {
+      "epoch": 1.645623492693999,
+      "grad_norm": 0.4859563410282135,
+      "learning_rate": 2.400204842869637e-06,
+      "loss": 0.5446,
+      "step": 5800
+    },
+    {
+      "epoch": 1.6512980564619095,
+      "grad_norm": 0.4492729902267456,
+      "learning_rate": 2.3259289293486246e-06,
+      "loss": 0.5418,
+      "step": 5820
+    },
+    {
+      "epoch": 1.6569726202298198,
+      "grad_norm": 0.46383896470069885,
+      "learning_rate": 2.252723718062787e-06,
+      "loss": 0.5401,
+      "step": 5840
+    },
+    {
+      "epoch": 1.6626471839977301,
+      "grad_norm": 0.48168492317199707,
+      "learning_rate": 2.1805953933693835e-06,
+      "loss": 0.5423,
+      "step": 5860
+    },
+    {
+      "epoch": 1.6683217477656405,
+      "grad_norm": 0.46742239594459534,
+      "learning_rate": 2.109550048650563e-06,
+      "loss": 0.542,
+      "step": 5880
+    },
+    {
+      "epoch": 1.6739963115335508,
+      "grad_norm": 0.46751725673675537,
+      "learning_rate": 2.0395936857986125e-06,
+      "loss": 0.5402,
+      "step": 5900
+    },
+    {
+      "epoch": 1.6796708753014613,
+      "grad_norm": 0.49627310037612915,
+      "learning_rate": 1.970732214708908e-06,
+      "loss": 0.5461,
+      "step": 5920
+    },
+    {
+      "epoch": 1.6853454390693714,
+      "grad_norm": 0.46826520562171936,
+      "learning_rate": 1.9029714527806652e-06,
+      "loss": 0.5385,
+      "step": 5940
+    },
+    {
+      "epoch": 1.691020002837282,
+      "grad_norm": 0.4701858162879944,
+      "learning_rate": 1.8363171244254606e-06,
+      "loss": 0.5376,
+      "step": 5960
+    },
+    {
+      "epoch": 1.6966945666051922,
+      "grad_norm": 0.4635229706764221,
+      "learning_rate": 1.7707748605836632e-06,
+      "loss": 0.5378,
+      "step": 5980
+    },
+    {
+      "epoch": 1.7023691303731026,
+      "grad_norm": 0.4729613661766052,
+      "learning_rate": 1.7063501982487135e-06,
+      "loss": 0.5437,
+      "step": 6000
+    },
+    {
+      "epoch": 1.7080436941410129,
+      "grad_norm": 0.4672451913356781,
+      "learning_rate": 1.6430485799993673e-06,
+      "loss": 0.5428,
+      "step": 6020
+    },
+    {
+      "epoch": 1.7137182579089232,
+      "grad_norm": 0.46772390604019165,
+      "learning_rate": 1.5808753535399022e-06,
+      "loss": 0.5392,
+      "step": 6040
+    },
+    {
+      "epoch": 1.7193928216768337,
+      "grad_norm": 0.46337825059890747,
+      "learning_rate": 1.5198357712483629e-06,
+      "loss": 0.5413,
+      "step": 6060
+    },
+    {
+      "epoch": 1.7250673854447438,
+      "grad_norm": 0.48103076219558716,
+      "learning_rate": 1.459934989732818e-06,
+      "loss": 0.5416,
+      "step": 6080
+    },
+    {
+      "epoch": 1.7307419492126543,
+      "grad_norm": 0.45769959688186646,
+      "learning_rate": 1.4011780693957492e-06,
+      "loss": 0.5436,
+      "step": 6100
+    },
+    {
+      "epoch": 1.7364165129805647,
+      "grad_norm": 0.4552821218967438,
+      "learning_rate": 1.3435699740065377e-06,
+      "loss": 0.5425,
+      "step": 6120
+    },
+    {
+      "epoch": 1.742091076748475,
+      "grad_norm": 0.48623600602149963,
+      "learning_rate": 1.2871155702821324e-06,
+      "loss": 0.5427,
+      "step": 6140
+    },
+    {
+      "epoch": 1.7477656405163853,
+      "grad_norm": 0.5024483799934387,
+      "learning_rate": 1.231819627475911e-06,
+      "loss": 0.5384,
+      "step": 6160
+    },
+    {
+      "epoch": 1.7534402042842956,
+      "grad_norm": 0.4556623101234436,
+      "learning_rate": 1.1776868169747702e-06,
+      "loss": 0.5393,
+      "step": 6180
+    },
+    {
+      "epoch": 1.7591147680522061,
+      "grad_norm": 0.4748471677303314,
+      "learning_rate": 1.1247217119044951e-06,
+      "loss": 0.5385,
+      "step": 6200
+    },
+    {
+      "epoch": 1.7647893318201162,
+      "grad_norm": 0.4622340500354767,
+      "learning_rate": 1.07292878674342e-06,
+      "loss": 0.5377,
+      "step": 6220
+    },
+    {
+      "epoch": 1.7704638955880267,
+      "grad_norm": 0.4581329822540283,
+      "learning_rate": 1.0223124169444236e-06,
+      "loss": 0.5366,
+      "step": 6240
+    },
+    {
+      "epoch": 1.776138459355937,
+      "grad_norm": 0.4667391777038574,
+      "learning_rate": 9.72876878565287e-07,
+      "loss": 0.539,
+      "step": 6260
+    },
+    {
+      "epoch": 1.7818130231238474,
+      "grad_norm": 0.4563803970813751,
+      "learning_rate": 9.246263479074663e-07,
+      "loss": 0.5403,
+      "step": 6280
+    },
+    {
+      "epoch": 1.7874875868917577,
+      "grad_norm": 0.44948819279670715,
+      "learning_rate": 8.775649011632703e-07,
+      "loss": 0.5392,
+      "step": 6300
+    },
+    {
+      "epoch": 1.793162150659668,
+      "grad_norm": 0.4829549193382263,
+      "learning_rate": 8.316965140715071e-07,
+      "loss": 0.5373,
+      "step": 6320
+    },
+    {
+      "epoch": 1.7988367144275785,
+      "grad_norm": 0.4718981683254242,
+      "learning_rate": 7.870250615816182e-07,
+      "loss": 0.5383,
+      "step": 6340
+    },
+    {
+      "epoch": 1.8045112781954886,
+      "grad_norm": 0.4641667306423187,
+      "learning_rate": 7.435543175263166e-07,
+      "loss": 0.543,
+      "step": 6360
+    },
+    {
+      "epoch": 1.8101858419633992,
+      "grad_norm": 0.45884087681770325,
+      "learning_rate": 7.012879543027801e-07,
+      "loss": 0.538,
+      "step": 6380
+    },
+    {
+      "epoch": 1.8158604057313092,
+      "grad_norm": 0.4888609051704407,
+      "learning_rate": 6.602295425624033e-07,
+      "loss": 0.5366,
+      "step": 6400
+    },
+    {
+      "epoch": 1.8215349694992198,
+      "grad_norm": 0.46243107318878174,
+      "learning_rate": 6.20382550909157e-07,
+      "loss": 0.5365,
+      "step": 6420
+    },
+    {
+      "epoch": 1.82720953326713,
+      "grad_norm": 0.46520647406578064,
+      "learning_rate": 5.817503456065559e-07,
+      "loss": 0.5339,
+      "step": 6440
+    },
+    {
+      "epoch": 1.8328840970350404,
+      "grad_norm": 0.47549664974212646,
+      "learning_rate": 5.443361902932792e-07,
+      "loss": 0.5361,
+      "step": 6460
+    },
+    {
+      "epoch": 1.838558660802951,
+      "grad_norm": 0.4677965044975281,
+      "learning_rate": 5.081432457074614e-07,
+      "loss": 0.5394,
+      "step": 6480
+    },
+    {
+      "epoch": 1.844233224570861,
+      "grad_norm": 0.46250638365745544,
+      "learning_rate": 4.7317456941966597e-07,
+      "loss": 0.5388,
+      "step": 6500
+    },
+    {
+      "epoch": 1.8499077883387716,
+      "grad_norm": 0.4758864641189575,
+      "learning_rate": 4.3943311557459177e-07,
+      "loss": 0.534,
+      "step": 6520
+    },
+    {
+      "epoch": 1.8555823521066817,
+      "grad_norm": 0.4370381832122803,
+      "learning_rate": 4.069217346415027e-07,
+      "loss": 0.5339,
+      "step": 6540
+    },
+    {
+      "epoch": 1.8612569158745922,
+      "grad_norm": 0.4617324769496918,
+      "learning_rate": 3.756431731734272e-07,
+      "loss": 0.5396,
+      "step": 6560
+    },
+    {
+      "epoch": 1.8669314796425025,
+      "grad_norm": 0.4532717168331146,
+      "learning_rate": 3.4560007357511856e-07,
+      "loss": 0.5393,
+      "step": 6580
+    },
+    {
+      "epoch": 1.8726060434104128,
+      "grad_norm": 0.46486184000968933,
+      "learning_rate": 3.16794973879837e-07,
+      "loss": 0.5367,
+      "step": 6600
+    },
+    {
+      "epoch": 1.8782806071783231,
+      "grad_norm": 0.44514200091362,
+      "learning_rate": 2.8923030753492783e-07,
+      "loss": 0.5384,
+      "step": 6620
+    },
+    {
+      "epoch": 1.8839551709462334,
+      "grad_norm": 0.4737865924835205,
+      "learning_rate": 2.6290840319625255e-07,
+      "loss": 0.5355,
+      "step": 6640
+    },
+    {
+      "epoch": 1.889629734714144,
+      "grad_norm": 0.45271801948547363,
+      "learning_rate": 2.378314845314561e-07,
+      "loss": 0.5451,
+      "step": 6660
+    },
+    {
+      "epoch": 1.895304298482054,
+      "grad_norm": 0.46050384640693665,
+      "learning_rate": 2.14001670032124e-07,
+      "loss": 0.5347,
+      "step": 6680
+    },
+    {
+      "epoch": 1.9009788622499646,
+      "grad_norm": 0.4726841151714325,
+      "learning_rate": 1.9142097283479876e-07,
+      "loss": 0.5428,
+      "step": 6700
+    },
+    {
+      "epoch": 1.906653426017875,
+      "grad_norm": 0.4662003815174103,
+      "learning_rate": 1.700913005509208e-07,
+      "loss": 0.5407,
+      "step": 6720
+    },
+    {
+      "epoch": 1.9123279897857852,
+      "grad_norm": 0.44422999024391174,
+      "learning_rate": 1.500144551056709e-07,
+      "loss": 0.535,
+      "step": 6740
+    },
+    {
+      "epoch": 1.9180025535536955,
+      "grad_norm": 0.4599597752094269,
+      "learning_rate": 1.3119213258574015e-07,
+      "loss": 0.5376,
+      "step": 6760
+    },
+    {
+      "epoch": 1.9236771173216058,
+      "grad_norm": 0.4735456705093384,
+      "learning_rate": 1.1362592309605291e-07,
+      "loss": 0.5392,
+      "step": 6780
+    },
+    {
+      "epoch": 1.9293516810895164,
+      "grad_norm": 0.4692912995815277,
+      "learning_rate": 9.731731062542604e-08,
+      "loss": 0.5398,
+      "step": 6800
     }
   ],
+  "logging_steps": 20,
+  "max_steps": 7048,
   "num_input_tokens_seen": 0,
+  "num_train_epochs": 2,
   "save_steps": 200,
   "stateful_callbacks": {
     "TrainerControl": {
       "attributes": {}
     }
   },
+  "total_flos": 1.5124467391135325e+20,
+  "train_batch_size": 1,
   "trial_name": null,
   "trial_params": null
 }

training_args.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:a6bf16ea130bda159d1af2ee62d236c7ae097ea41c8408d8221e7b326b65872b
-size 6456

 version https://git-lfs.github.com/spec/v1
+oid sha256:ffd93f25c50f75fbd7f7b6ad5a315acf357ca57e88203e0285f40efaac4f4e34
+size 6520