File size: 5,485 Bytes
b1cb7d2 7771e69 b1cb7d2 7771e69 b1cb7d2 beaaaef b1cb7d2 323e4f6 b1cb7d2 323e4f6 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 |
---
base_model: Sao10K/L3-8B-Stheno-v3.3-32K
quantized_by: Lewdiculous
library_name: transformers
license: cc-by-nc-4.0
inference: false
language:
- en
tags:
- roleplay
- llama3
- sillytavern
---
# #roleplay #sillytavern #llama3
My GGUF-IQ-Imatrix quants for [**Sao10K/L3-8B-Stheno-v3.3-32K**](https://huggingface.co/Sao10K/L3-8B-Stheno-v3.3-32K).
**Sao10K** with Stheno **yet** again, now bigger and better than ever! <br>
I recommend checking his page for feedback and support.
> [!IMPORTANT]
> **Quantization process:** <br>
> Imatrix data was generated from the FP16-GGUF and conversions directly from the BF16-GGUF. <br>
> This is a bit more disk and compute intensive but hopefully avoids any losses during conversion. <br>
> To run this model, please use the [**latest version of KoboldCpp**](https://github.com/LostRuins/koboldcpp/releases/latest). <br>
> If you noticed any issues let me know in the discussions.
> [!NOTE]
> **General usage:** <br>
> For **8GB VRAM** GPUs, I recommend the **Q4_K_M-imat** (4.89 BPW) quant for up to 12288 context sizes. <br>
>
> **Presets:** <br>
> Some compatible SillyTavern presets can be found [**here (Virt's Roleplay Presets)**](https://huggingface.co/Virt-io/SillyTavern-Presets). <br>
> Check [**discussions such as this one**](https://huggingface.co/Virt-io/SillyTavern-Presets/discussions/5#664d6fb87c563d4d95151baa) for other recommendations and samplers.
<details>
<summary>⇲ Click here to expand/hide information – General chart with relative quant parformances.</summary>
> [!NOTE]
> **Recommended read:** <br>
>
> [**"Which GGUF is right for me? (Opinionated)" by Artefact2**](https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9)
>
> *Click the image to view full size.*
> !["Which GGUF is right for me? (Opinionated)" by Artefact2 - Firs Graph](https://cdn-uploads.huggingface.co/production/uploads/65d4cf2693a0a3744a27536c/fScWdHIPix5IzNJ8yswCB.webp)
</details>
> [!TIP]
> **Personal-support:** <br>
> I apologize for disrupting your experience. <br>
> Eventually I may be able to use a dedicated server for this, but for now hopefully these quants are helpful. <br>
> If you **want** and you are **able to**... <br>
> You can [**spare some change over here (Ko-fi)**](https://ko-fi.com/Lewdiculous). <br>
>
> **Author-support:** <br>
> You can support the author [**at their own page**](https://ko-fi.com/sao10k).
![image/png](https://cdn-uploads.huggingface.co/production/uploads/65d4cf2693a0a3744a27536c/1wb5-yFyvxWQSWBMlB36x.png)
<details>
<summary>Original model card information.</summary>
## **Original card:**
Trained with compute from [Backyard.ai](https://backyard.ai/) | Thanks to them and @dynafire for helping me out.
---
Training Details:
<br>Trained at 8K Context -> Expanded to 32K Context with PoSE training.
Dataset Modifications:
<br>\- Further Cleaned up Roleplaying Samples -> Quality Check
<br>\- Removed Low Quality Samples from Manual Check -> Increased Baseline Quality Floor
<br>\- More Creative Writing Samples -> 2x Samples
<br>\- Remade and Refined Detailed Instruct Data
Notes:
<br>\- Training run is much less aggressive than previous Stheno versions.
<br>\- This model works when tested in bf16 with the same configs as within the file.
<br>\- I do not know the effects quantisation has on it.
<br>\- Roleplays pretty well. Feels nice in my opinion.
<br>\- It has some issues on long context understanding and reasoning. Much better vs rope scaling normally though, so that is a plus.
<br>\- Reminder, this isn't a native 32K model. It has it's issues, but it's coherent and working well.
Sanity Check // Needle in a Haystack Results:
<br>\- This is not as complex as RULER or NIAN, but it's a basic evaluator. Some improper train examples had Haystack scores ranging from Red to Orange for most of the extended contexts.
![Results](https://huggingface.co/Sao10K/L3-8B-Stheno-v3.3-32K/resolve/main/haystack.png)
Wandb Run:
![Wandb](https://huggingface.co/Sao10K/L3-8B-Stheno-v3.3-32K/resolve/main/wandb.png)
---
Relevant Axolotl Configurations:
<br>-> Taken from [winglian/Llama-3-8b-64k-PoSE](https://huggingface.co/winglian/Llama-3-8b-64k-PoSE)
<br>\- I tried to find my own configs, hours of tinkering but the one he used worked best, so I stuck to it.
<br>\- 2M Rope Theta had the best loss results during training compared to other values.
<br>\- Leaving it at 500K rope wasn't that much worse, but 4M and 8M Theta made the grad_norm values worsen even if loss drops fast.
<br>\- Mixing in Pretraining Data was a PITA. Made it a lot worse with formatting.
<br>\- Pretraining / Noise made it worse at Haystack too? It wasn't all Green, Mainly Oranges.
<br>\- Improper / Bad Rope Theta shows in Grad_Norm exploding to thousands. It'll drop to low values alright, but it's a scary fast drop even with gradient clipping.
```
sequence_len: 8192
use_pose: true
pose_max_context_len: 32768
overrides_of_model_config:
rope_theta: 2000000.0
max_position_embeddings: 32768
# peft_use_dora: true
adapter: lora
peft_use_rslora: true
lora_model_dir:
lora_r: 256
lora_alpha: 256
lora_dropout: 0.1
lora_target_linear: true
lora_target_modules:
- gate_proj
- down_proj
- up_proj
- q_proj
- v_proj
- k_proj
- o_proj
warmup_steps: 80
gradient_accumulation_steps: 6
micro_batch_size: 1
num_epochs: 2
optimizer: adamw_bnb_8bit
lr_scheduler: cosine_with_min_lr
learning_rate: 0.00004
lr_scheduler_kwargs:
min_lr: 0.000004
```
</details> |