kaiokendev
/

superhot-13b-8k-no-rlhf-test

Model card Files Files and versions Community

kaiokendev commited on Jun 20, 2023

Commit

67bf26a

1 Parent(s): 65084ac

Upload lora

Browse files

Files changed (4) hide show

README.md +31 -0
adapter_config.json +19 -0
adapter_model.bin +3 -0
llama_rope_scaled_monkey_patch.py +63 -0

README.md CHANGED Viewed

@@ -1,3 +1,34 @@
 ---
 license: mit
 ---

 ---
 license: mit
 ---
+### SuperHOT Prototype 2 w/ 4-8K Context
+This is a second prototype of SuperHOT, this time with 4K context and no RLHF. In my testing, it can go all the way to 6K without breaking down and I made the change with intention to reach 8K, so I'll assume it will go to 8K although I only trained on 4K sequences.
+In order to use the 8K context, you will need to apply the monkeypatch I have added in this repo -- without it, it will not work. The patch is very simple, and you can make the changes yourself:
+- Increase the `max_position_embeddings` to 8192 to stretch the sinusoidal
+- Stretch the frequency steps by a scale of `0.25`
+The intuition is to calibrate the model to within the learned positions of the pre-trained model as the model may be overfit on the token-position relationship (not my idea, [Ofir Press'](https://ofir.io/)). By interpolating the encodings, we remain within the bounds of the pre-trained model (work with the overfitting rather than against it). The monkeypatch will work for the pre-trained model without fine-tuning, but you will need to fine-tune as the results will not be that good without it.
+It can probably be even better than this with a few other modifications which I am testing (swap softmax for ReLU, increase head dimension)
+In my testing, I tried random positional encoding, but I was not able to replicate the results of [Jianlin Su](https://kexue.fm/archives/9444), so maybe I did it incorrectly. I also tried shifted positions, log n scaling, log-sigmoid, and increase the head dimension, though this dilated RoPE (DoPE :) ) is the only one which worked for me consistently -- Note these are all based on finetuning, since the goal is to extend the context of the pre-trained model. Pre-training will paint a different picture.
+I trained the LoRA with the following configuration:
+- 1200 samples (~400 samples over 2048 sequence length)
+- learning rate of 3e-4
+- 3 epochs
+- The exported modules are:
+    - q_proj
+    - k_proj
+    - v_proj
+    - o_proj
+    - all bias
+- Rank = 2
+- Alpha = 8
+- no dropout
+- weight decay of 0.1
+- AdamW beta1 of 0.9 and beta2 0.99, epsilon of 1e-5
+- Trained on 4-bit base model

adapter_config.json ADDED Viewed

	@@ -0,0 +1,19 @@

+{
+  "base_model_name_or_path": "",
+  "bias": "all",
+  "fan_in_fan_out": false,
+  "inference_mode": true,
+  "init_lora_weights": true,
+  "lora_alpha": 8,
+  "lora_dropout": 0,
+  "modules_to_save": null,
+  "peft_type": "LORA",
+  "r": 2,
+  "target_modules": [
+    "q_proj",
+    "k_proj",
+    "v_proj",
+    "o_proj"
+  ],
+  "task_type": "CAUSAL_LM"
+}

adapter_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:76133dc631ac8dc28341c45f8f469cc603174cb1d16c728f65b33778f8f497e4
+size 17579562

llama_rope_scaled_monkey_patch.py ADDED Viewed

	@@ -0,0 +1,63 @@

+import torch
+import transformers
+import transformers.models.llama.modeling_llama
+from einops import rearrange
+import random
+class ScaledRotaryEmbedding(torch.nn.Module):
+    def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None):
+        super().__init__()
+        inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2).float().to(device) / dim))
+        self.register_buffer("inv_freq", inv_freq)
+        max_position_embeddings = 8192
+        # Build here to make `torch.jit.trace` work.
+        self.max_seq_len_cached = max_position_embeddings
+        t = torch.arange(
+            self.max_seq_len_cached,
+            device=self.inv_freq.device,
+            dtype=self.inv_freq.dtype,
+        )
+        self.scale = 1 / 4
+        t *= self.scale
+        freqs = torch.einsum("i,j->ij", t, self.inv_freq)
+        # Different from paper, but it uses a different permutation in order to obtain the same calculation
+        emb = torch.cat((freqs, freqs), dim=-1)
+        self.register_buffer(
+            "cos_cached", emb.cos()[None, None, :, :], persistent=False
+        )
+        self.register_buffer(
+            "sin_cached", emb.sin()[None, None, :, :], persistent=False
+        )
+    def forward(self, x, seq_len=None):
+        # x: [bs, num_attention_heads, seq_len, head_size]
+        # This `if` block is unlikely to be run after we build sin/cos in `__init__`. Keep the logic here just in case.
+        if seq_len > self.max_seq_len_cached:
+            self.max_seq_len_cached = seq_len
+            t = torch.arange(
+                self.max_seq_len_cached, device=x.device, dtype=self.inv_freq.dtype
+            )
+            freqs = torch.einsum("i,j->ij", t, self.inv_freq)
+            # Different from paper, but it uses a different permutation in order to obtain the same calculation
+            emb = torch.cat((freqs, freqs), dim=-1).to(x.device)
+            self.register_buffer(
+                "cos_cached", emb.cos()[None, None, :, :], persistent=False
+            )
+            self.register_buffer(
+                "sin_cached", emb.sin()[None, None, :, :], persistent=False
+            )
+        return (
+            self.cos_cached[:, :, :seq_len, ...].to(dtype=x.dtype),
+            self.sin_cached[:, :, :seq_len, ...].to(dtype=x.dtype),
+        )
+def replace_llama_rope_with_scaled_rope():
+    transformers.models.llama.modeling_llama.LlamaRotaryEmbedding = (
+        ScaledRotaryEmbedding
+    )