bhenrym14 commited on
Commit
7b6a18e
·
1 Parent(s): a6e30d7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -2
README.md CHANGED
@@ -29,8 +29,8 @@ Otherwise for context <8k. Use exllama. Set `max_seq_len` to 16384, and `compres
29
 
30
  ## Motivation
31
  Recent advancements in extending context by RoPE scaling ([kaiokendev](https://kaiokendev.github.io/til#extending-context-to-8k) and [meta AI)](https://arxiv.org/abs/2306.15595)) demonstrate the ability to extend the context window without (total) retraining. My prior experiments have found the following:
32
- - An adapter finetuned with the scaled embeddings, applied to a base model other than the one upon which it was trained, brings a significant performance penalty at all context lengths. (see 13b and 33b PI).
33
- - Pretraining on sequences equal in length to the maximum given by the scaling factor improves performance considerably. This is most notable at the longest contexts lengths. In fact, for the 7b model it was necessary to achieve decreasing perplexity beyond 8k tokens for the (see airoboros-7b-lctx-).
34
 
35
  This model applies the pretraining methodology at 8192 sequence length, but uses a scaling factor of 8, giving a theoretical max context of 16384. Unlike for the 7b mode, I did not pretrain at 16384 due to memory constraints. How will this model perform at contexts >8k? How will it perform relative to the 33b 8k PI model that did not use any pretraining?
36
 
 
29
 
30
  ## Motivation
31
  Recent advancements in extending context by RoPE scaling ([kaiokendev](https://kaiokendev.github.io/til#extending-context-to-8k) and [meta AI)](https://arxiv.org/abs/2306.15595)) demonstrate the ability to extend the context window without (total) retraining. My prior experiments have found the following:
32
+ - An adapter finetuned with the scaled embeddings, applied to a base model other than the one upon which it was trained, brings a significant performance penalty at all context lengths. ([airoboros-13b-gpt4-1.4.1-PI-8192](https://huggingface.co/bhenrym14/airoboros-13b-gpt4-1.4.1-PI-8192-GPTQ)).
33
+ - Pretraining on sequences equal in length to the maximum given by the scaling factor improves performance considerably. This is most notable at the longest contexts lengths. In fact, for the 7b model it was necessary to achieve decreasing perplexity beyond 8k tokens for the (see [airoboros-7b-gpt4-1.4.1-lxctx-PI-16384](https://huggingface.co/bhenrym14/airoboros-7b-gpt4-1.4.1-lxctx-PI-16384-fp16)).
34
 
35
  This model applies the pretraining methodology at 8192 sequence length, but uses a scaling factor of 8, giving a theoretical max context of 16384. Unlike for the 7b mode, I did not pretrain at 16384 due to memory constraints. How will this model perform at contexts >8k? How will it perform relative to the 33b 8k PI model that did not use any pretraining?
36