bhenrym14 commited on
Commit
b6bbdb4
1 Parent(s): 673abc7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +15 -3
README.md CHANGED
@@ -1,10 +1,22 @@
1
  ## Overview
2
 
3
- This is Jon Durbin's Airoboros 33B GPT4 1.4 with several key modifications:
4
- - Context length extended to 8192 by RoPE Scaled Embeddings
5
  - Training sequences beyond 2048 have the target truncated to equal 2048.
6
 
7
- I emulated all other training
 
 
 
 
 
 
 
 
 
 
 
 
8
 
9
 
10
 
 
1
  ## Overview
2
 
3
+ This is [Jon Durbin's Airoboros 33B GPT4 1.4](https://huggingface.co/jondurbin/airoboros-33b-gpt4-1.4) (with GPTQ Quantization) with several key modifications:
4
+ - Context length extended to 8192 by RoPE Scaled Embeddings, but NOT via the superHOT LoRA.
5
  - Training sequences beyond 2048 have the target truncated to equal 2048.
6
 
7
+ Otherwise, I emulated the training process as closely as possible. It was trained on 1x RTX 6000 Ada for ~43 hours.
8
+
9
+ ## Motivation
10
+ Recent advancements in extending context by RoPE scaling ([kaiokendev](https://kaiokendev.github.io/til#extending-context-to-8k) and [(meta AI)](https://arxiv.org/abs/2306.15595)) demonstrate the ability to extend the context window without (total) retraining. Finetuning has shown to be necessary to properly leverage the longer context. The superHOT LoRA is a finetuned adapter that has been finetuned on longer context (8192 tokens); even when applied to dissimilar models, it successfully extends the contexts window to which the model can attend. While impressive this adapter is so flexible, how much does performance suffer relative to a model that has been finetuned with the scaled embeddings from the start? This is an experiment to explore this.
11
+
12
+ ## Relative Performance (perplexity)
13
+
14
+
15
+
16
+ ## Quantization:
17
+
18
+ The merged model was quantized with AutoGPTQ (bits = 4, group_size = 128, desc_act = True)
19
+
20
 
21
 
22