bhenrym14
/

airoboros-33b-gpt4-1.4.1-PI-8192-GPTQ

Text Generation

Transformers

llama

Inference Endpoints

Model card Files Files and versions Community

bhenrym14 commited on Jul 3, 2023

Commit

444fb74

•

1 Parent(s): b6bbdb4

Update README.md

Browse files

Files changed (1) hide show

README.md +11 -5

README.md CHANGED Viewed

@@ -1,3 +1,4 @@
 ## Overview
 This is [Jon Durbin's Airoboros 33B GPT4 1.4](https://huggingface.co/jondurbin/airoboros-33b-gpt4-1.4) (with GPTQ Quantization) with several key modifications:
@@ -7,20 +8,25 @@ This is [Jon Durbin's Airoboros 33B GPT4 1.4](https://huggingface.co/jondurbin/a
 Otherwise, I emulated the training process as closely as possible. It was trained on 1x RTX 6000 Ada for ~43 hours.
 ## Motivation
-Recent advancements in extending context by RoPE scaling ([kaiokendev](https://kaiokendev.github.io/til#extending-context-to-8k) and [(meta AI)](https://arxiv.org/abs/2306.15595)) demonstrate the ability to extend the context window without (total) retraining. Finetuning has shown to be necessary to properly leverage the longer context. The superHOT LoRA is a finetuned adapter that has been finetuned on longer context (8192 tokens); even when applied to dissimilar models, it successfully extends the contexts window to which the model can attend. While impressive this adapter is so flexible, how much does performance suffer relative to a model that has been finetuned with the scaled embeddings from the start? This is an experiment to explore this.
 ## Relative Performance (perplexity)
 ## Quantization:
-The merged model was quantized with AutoGPTQ (bits = 4, group_size = 128, desc_act = True)
-## Original model card: Jon Durbin's Airoboros 33B GPT4 1.4
 __not yet tested!__

+# RoPE Scaled Finetune of airoboros-33b-gpt4-1.4.1 (GPTQ)
 ## Overview
 This is [Jon Durbin's Airoboros 33B GPT4 1.4](https://huggingface.co/jondurbin/airoboros-33b-gpt4-1.4) (with GPTQ Quantization) with several key modifications:
 Otherwise, I emulated the training process as closely as possible. It was trained on 1x RTX 6000 Ada for ~43 hours.
 ## Motivation
+Recent advancements in extending context by RoPE scaling ([kaiokendev](https://kaiokendev.github.io/til#extending-context-to-8k) and [meta AI)](https://arxiv.org/abs/2306.15595)) demonstrate the ability to extend the context window without (total) retraining. Finetuning has shown to be necessary to properly leverage the longer context. The superHOT LoRA is a finetuned adapter that has been finetuned on longer context (8192 tokens); even when applied to dissimilar models, it successfully extends the contexts window to which the model can attend. While impressive this adapter is so flexible, how much does performance suffer relative to a model that has been finetuned with the scaled embeddings from the start? This is an experiment to explore this.
 ## Relative Performance (perplexity)
+| Model                                                | Context     | Perplexity |
+| ---------------------------------------------------- | ----------- | ---------- |
+| TheBloke/airoboros-33B-gpt4-1-4-SuperHOT-8K-GPTQ     | 2048        | 5.15       |
+| TheBloke/airoboros-33B-gpt4-1-4-SuperHOT-8K-GPTQ     | 8192        | 5.04       |
+| **bhenrym14/airoboros-33b-gpt4-1.4.1-PI-8192-GPTQ**    | **2048**    | **4.32**   |
+| **bhenrym14/airoboros-33b-gpt4-1.4.1-PI-8192-GPTQ**    | **3072**    | **4.26**   |
+How does this reduction in perplexity translate into actual performance lift on downstream tasks? I'm not sure yet.
 ## Quantization:
+The merged model was quantized with AutoGPTQ (bits = 4, group_size = 128, desc_act = True). If there's interest, I can upload the LoRA weights and/or merged 16bit HF model.
+# Original model card: Jon Durbin's Airoboros 33B GPT4 1.4
 __not yet tested!__