Merry commited on
Commit
a695a4c
1 Parent(s): 865fccc

Cleaning up the README some more

Browse files
Files changed (1) hide show
  1. README.md +9 -283
README.md CHANGED
@@ -35,10 +35,6 @@ ggmlv3-pythia-2.8b-deduped-q5_1.bin | 2.3 GiB
35
 
36
  *Tested on KoboldCpp with OpenBLAS enabled.*
37
 
38
- **Notes:**
39
- - The models have been converted with ggerganov/ggml's gpt-neox conversion script, and tested only on KoboldCpp. Other frontends that support GGML-based conversions of GPT-NeoX *should* work, but I can't promise anything.
40
- - They're sorted by date based on when they were made so it was easier to track breaking changes. If you're just starting off I highly recommend the latest (which is 2023-05-25). Combined with KoboldCpp v1.25.1+ this improved the tokenizer, which in my testing reduces occurrences of broken words like "Alicae" or "Reimu Hai-ku-rei".
41
-
42
  **Versions:**
43
 
44
  **2023-04-20:** *q4_3. Used [commit 05f3079](https://github.com/ggerganov/ggml/tree/05f307971862b83df12fada0c42ee027ba5a82b5/examples/stablelm)*
@@ -51,284 +47,14 @@ ggmlv3-pythia-2.8b-deduped-q5_1.bin | 2.3 GiB
51
 
52
  **2023-05-25:** *New quantization format (ggmlv3). q4_0 and q5_1, up to 2.8B. Used [commit 73ad593](https://github.com/ggerganov/ggml/tree/73ad593cf84f864f0fcfd3a196253575c70d66a2/examples/gpt-neox)*
53
 
54
- They're separated by date and commit so it's easier to track of any breaking changes.
 
 
55
 
56
  # ALTERNATIVES
57
- If you're here because you want a smaller model to run on a device with constrained memory, consider the following:
58
- - OpenLLaMA [3B](https://huggingface.co/openlm-research/open_llama_3b_350bt_preview) [(7B)](https://huggingface.co/openlm-research/open_llama_7b_400bt_preview)
59
- - RedPajama-INCITE [(3B)](https://huggingface.co/togethercomputer/RedPajama-INCITE-Base-3B-v1) [(7B)](https://huggingface.co/togethercomputer/RedPajama-INCITE-Base-7B-v0.1)
60
- - MPT [(1B)](https://huggingface.co/mosaicml/mpt-1b-redpajama-200b) [(7B)](https://huggingface.co/mosaicml/mpt-7b).
61
- - RWKV PilePlus [(169M) (430M) (1.5B) (3B)](https://huggingface.co/BlinkDL/rwkv-4-pileplus)
62
-
63
- All of them are trained at least partially on an open reproduction of LLaMA's dataset, [RedPajama](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T), but they're based on different architectures. OpenLLaMA is based on the LLaMA architecture (making it compatible with llama.cpp), RedPajama-INCITE is based on GPT-NeoX, and MPT and RWKV use their own.
64
-
65
- Below is the original model card for Pythia 1.4B Deduped.
66
-
67
- * * *
68
-
69
- The *Pythia Scaling Suite* is a collection of models developed to facilitate
70
- interpretability research. It contains two sets of eight models of sizes
71
- 70M, 160M, 410M, 1B, 1.4B, 2.8B, 6.9B, and 12B. For each size, there are two
72
- models: one trained on the Pile, and one trained on the Pile after the dataset
73
- has been globally deduplicated. All 8 model sizes are trained on the exact
74
- same data, in the exact same order. We also provide 154 intermediate
75
- checkpoints per model, hosted on Hugging Face as branches.
76
-
77
- The Pythia model suite was designed to promote scientific
78
- research on large language models, especially interpretability research.
79
- Despite not centering downstream performance as a design goal, we find the
80
- models <a href="#evaluations">match or exceed</a> the performance of
81
- similar and same-sized models, such as those in the OPT and GPT-Neo suites.
82
-
83
- <details>
84
- <summary style="font-weight:600">Details on previous early release and naming convention.</summary>
85
-
86
- Previously, we released an early version of the Pythia suite to the public.
87
- However, we decided to retrain the model suite to address a few hyperparameter
88
- discrepancies. This model card <a href="#changelog">lists the changes</a>;
89
- see appendix B in the Pythia paper for further discussion. We found no
90
- difference in benchmark performance between the two Pythia versions.
91
- The old models are
92
- [still available](https://huggingface.co/models?other=pythia_v0), but we
93
- suggest the retrained suite if you are just starting to use Pythia.<br>
94
- **This is the current release.**
95
-
96
- Please note that all models in the *Pythia* suite were renamed in January
97
- 2023. For clarity, a <a href="#naming-convention-and-parameter-count">table
98
- comparing the old and new names</a> is provided in this model card, together
99
- with exact parameter counts.
100
- </details>
101
- <br>
102
-
103
- # Pythia-1.4B-deduped
104
-
105
- ## Model Details
106
-
107
- - Developed by: [EleutherAI](http://eleuther.ai)
108
- - Model type: Transformer-based Language Model
109
- - Language: English
110
- - Learn more: [Pythia's GitHub repository](https://github.com/EleutherAI/pythia)
111
- for training procedure, config files, and details on how to use.
112
- - Library: [GPT-NeoX](https://github.com/EleutherAI/gpt-neox)
113
- - License: Apache 2.0
114
- - Contact: to ask questions about this model, join the [EleutherAI
115
- Discord](https://discord.gg/zBGx3azzUn), and post them in `#release-discussion`.
116
- Please read the existing *Pythia* documentation before asking about it in the
117
- EleutherAI Discord. For general correspondence: [contact@eleuther.
118
- ai](mailto:contact@eleuther.ai).
119
-
120
- <figure>
121
-
122
- | Pythia model | Non-Embedding Params | Layers | Model Dim | Heads | Batch Size | Learning Rate | Equivalent Models |
123
- | -----------: | -------------------: | :----: | :-------: | :---: | :--------: | :-------------------: | :--------------------: |
124
- | 70M | 18,915,328 | 6 | 512 | 8 | 2M | 1.0 x 10<sup>-3</sup> | — |
125
- | 160M | 85,056,000 | 12 | 768 | 12 | 4M | 6.0 x 10<sup>-4</sup> | GPT-Neo 125M, OPT-125M |
126
- | 410M | 302,311,424 | 24 | 1024 | 16 | 4M | 3.0 x 10<sup>-4</sup> | OPT-350M |
127
- | 1.0B | 805,736,448 | 16 | 2048 | 8 | 2M | 3.0 x 10<sup>-4</sup> | — |
128
- | 1.4B | 1,208,602,624 | 24 | 2048 | 16 | 4M | 2.0 x 10<sup>-4</sup> | GPT-Neo 1.3B, OPT-1.3B |
129
- | 2.8B | 2,517,652,480 | 32 | 2560 | 32 | 2M | 1.6 x 10<sup>-4</sup> | GPT-Neo 2.7B, OPT-2.7B |
130
- | 6.9B | 6,444,163,072 | 32 | 4096 | 32 | 2M | 1.2 x 10<sup>-4</sup> | OPT-6.7B |
131
- | 12B | 11,327,027,200 | 36 | 5120 | 40 | 2M | 1.2 x 10<sup>-4</sup> | — |
132
- <figcaption>Engineering details for the <i>Pythia Suite</i>. Deduped and
133
- non-deduped models of a given size have the same hyperparameters. “Equivalent”
134
- models have <b>exactly</b> the same architecture, and the same number of
135
- non-embedding parameters.</figcaption>
136
- </figure>
137
-
138
- ## Uses and Limitations
139
-
140
- ### Intended Use
141
-
142
- The primary intended use of Pythia is research on the behavior, functionality,
143
- and limitations of large language models. This suite is intended to provide
144
- a controlled setting for performing scientific experiments. We also provide
145
- 154 checkpoints per model: initial `step0`, 10 log-spaced checkpoints
146
- `step{1,2,4...512}`, and 143 evenly-spaced checkpoints from `step1000` to
147
- `step143000`. These checkpoints are hosted on Hugging Face as branches. Note
148
- that branch `143000` corresponds exactly to the model checkpoint on the `main`
149
- branch of each model.
150
-
151
- You may also further fine-tune and adapt Pythia-1.4B-deduped for deployment,
152
- as long as your use is in accordance with the Apache 2.0 license. Pythia
153
- models work with the Hugging Face [Transformers
154
- Library](https://huggingface.co/docs/transformers/index). If you decide to use
155
- pre-trained Pythia-1.4B-deduped as a basis for your fine-tuned model, please
156
- conduct your own risk and bias assessment.
157
-
158
- ### Out-of-scope use
159
-
160
- The Pythia Suite is **not** intended for deployment. It is not a in itself
161
- a product and cannot be used for human-facing interactions. For example,
162
- the model may generate harmful or offensive text. Please evaluate the risks
163
- associated with your particular use case.
164
-
165
- Pythia models are English-language only, and are not suitable for translation
166
- or generating text in other languages.
167
-
168
- Pythia-1.4B-deduped has not been fine-tuned for downstream contexts in which
169
- language models are commonly deployed, such as writing genre prose,
170
- or commercial chatbots. This means Pythia-1.4B-deduped will **not**
171
- respond to a given prompt the way a product like ChatGPT does. This is because,
172
- unlike this model, ChatGPT was fine-tuned using methods such as Reinforcement
173
- Learning from Human Feedback (RLHF) to better “follow” human instructions.
174
-
175
- ### Limitations and biases
176
-
177
- The core functionality of a large language model is to take a string of text
178
- and predict the next token. The token used by the model need not produce the
179
- most “accurate” text. Never rely on Pythia-1.4B-deduped to produce factually accurate
180
- output.
181
-
182
- This model was trained on [the Pile](https://pile.eleuther.ai/), a dataset
183
- known to contain profanity and texts that are lewd or otherwise offensive.
184
- See [Section 6 of the Pile paper](https://arxiv.org/abs/2101.00027) for a
185
- discussion of documented biases with regards to gender, religion, and race.
186
- Pythia-1.4B-deduped may produce socially unacceptable or undesirable text, *even if*
187
- the prompt itself does not include anything explicitly offensive.
188
-
189
- If you plan on using text generated through, for example, the Hosted Inference
190
- API, we recommend having a human curate the outputs of this language model
191
- before presenting it to other people. Please inform your audience that the
192
- text was generated by Pythia-1.4B-deduped.
193
-
194
- ### Quickstart
195
-
196
- Pythia models can be loaded and used via the following code, demonstrated here
197
- for the third `pythia-70m-deduped` checkpoint:
198
-
199
- ```python
200
- from transformers import GPTNeoXForCausalLM, AutoTokenizer
201
-
202
- model = GPTNeoXForCausalLM.from_pretrained(
203
- "EleutherAI/pythia-70m-deduped",
204
- revision="step3000",
205
- cache_dir="./pythia-70m-deduped/step3000",
206
- )
207
-
208
- tokenizer = AutoTokenizer.from_pretrained(
209
- "EleutherAI/pythia-70m-deduped",
210
- revision="step3000",
211
- cache_dir="./pythia-70m-deduped/step3000",
212
- )
213
-
214
- inputs = tokenizer("Hello, I am", return_tensors="pt")
215
- tokens = model.generate(**inputs)
216
- tokenizer.decode(tokens[0])
217
- ```
218
-
219
- Revision/branch `step143000` corresponds exactly to the model checkpoint on
220
- the `main` branch of each model.<br>
221
- For more information on how to use all Pythia models, see [documentation on
222
- GitHub](https://github.com/EleutherAI/pythia).
223
-
224
- ## Training
225
-
226
- ### Training data
227
-
228
- Pythia-1.4B-deduped was trained on the Pile **after the dataset has been globally
229
- deduplicated**.<br>
230
- [The Pile](https://pile.eleuther.ai/) is a 825GiB general-purpose dataset in
231
- English. It was created by EleutherAI specifically for training large language
232
- models. It contains texts from 22 diverse sources, roughly broken down into
233
- five categories: academic writing (e.g. arXiv), internet (e.g. CommonCrawl),
234
- prose (e.g. Project Gutenberg), dialogue (e.g. YouTube subtitles), and
235
- miscellaneous (e.g. GitHub, Enron Emails). See [the Pile
236
- paper](https://arxiv.org/abs/2101.00027) for a breakdown of all data sources,
237
- methodology, and a discussion of ethical implications. Consult [the
238
- datasheet](https://arxiv.org/abs/2201.07311) for more detailed documentation
239
- about the Pile and its component datasets. The Pile can be downloaded from
240
- the [official website](https://pile.eleuther.ai/), or from a [community
241
- mirror](https://the-eye.eu/public/AI/pile/).
242
-
243
- ### Training procedure
244
-
245
- All models were trained on the exact same data, in the exact same order. Each
246
- model saw 299,892,736,000 tokens during training, and 143 checkpoints for each
247
- model are saved every 2,097,152,000 tokens, spaced evenly throughout training,
248
- from `step1000` to `step143000` (which is the same as `main`). In addition, we
249
- also provide frequent early checkpoints: `step0` and `step{1,2,4...512}`.
250
- This corresponds to training for just under 1 epoch on the Pile for
251
- non-deduplicated models, and about 1.5 epochs on the deduplicated Pile.
252
-
253
- All *Pythia* models trained for 143000 steps at a batch size
254
- of 2M (2,097,152 tokens).<br>
255
- See [GitHub](https://github.com/EleutherAI/pythia) for more details on training
256
- procedure, including [how to reproduce
257
- it](https://github.com/EleutherAI/pythia/blob/main/README.md#reproducing-training).<br>
258
- Pythia uses the same tokenizer as [GPT-NeoX-
259
- 20B](https://huggingface.co/EleutherAI/gpt-neox-20b).
260
-
261
- ## Evaluations
262
-
263
- All 16 *Pythia* models were evaluated using the [LM Evaluation
264
- Harness](https://github.com/EleutherAI/lm-evaluation-harness). You can access
265
- the results by model and step at `results/json/*` in the [GitHub
266
- repository](https://github.com/EleutherAI/pythia/tree/main/results/json/).<br>
267
- Expand the sections below to see plots of evaluation results for all
268
- Pythia and Pythia-deduped models compared with OPT and BLOOM.
269
-
270
- <details>
271
- <summary>LAMBADA – OpenAI</summary>
272
- <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/lambada_openai_v1.png" style="width:auto"/>
273
- </details>
274
-
275
- <details>
276
- <summary>Physical Interaction: Question Answering (PIQA)</summary>
277
- <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/piqa_v1.png" style="width:auto"/>
278
- </details>
279
-
280
- <details>
281
- <summary>WinoGrande</summary>
282
- <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/winogrande_v1.png" style="width:auto"/>
283
- </details>
284
-
285
- <details>
286
- <summary>AI2 Reasoning Challenge—Easy Set</summary>
287
- <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/arc_easy_v1.png" style="width:auto"/>
288
- </details>
289
-
290
- <details>
291
- <summary>SciQ</summary>
292
- <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/sciq_v1.png" style="width:auto"/>
293
- </details>
294
-
295
- ## Changelog
296
-
297
- This section compares differences between previously released
298
- [Pythia v0](https://huggingface.co/models?other=pythia_v0) and the current
299
- models. See Appendix B of the Pythia paper for further discussion of these
300
- changes and the motivation behind them. We found that retraining Pythia had no
301
- impact on benchmark performance.
302
-
303
- - All model sizes are now trained with uniform batch size of 2M tokens.
304
- Previously, the models of size 160M, 410M, and 1.4B parameters were trained
305
- with batch sizes of 4M tokens.
306
- - We added checkpoints at initialization (step 0) and steps {1,2,4,8,16,32,64,
307
- 128,256,512} in addition to every 1000 training steps.
308
- - Flash Attention was used in the new retrained suite.
309
- - We remedied a minor inconsistency that existed in the original suite: all
310
- models of size 2.8B parameters or smaller had a learning rate (LR) schedule
311
- which decayed to a minimum LR of 10% the starting LR rate, but the 6.9B and
312
- 12B models all used an LR schedule which decayed to a minimum LR of 0. In
313
- the redone training runs, we rectified this inconsistency: all models now were
314
- trained with LR decaying to a minimum of 0.1× their maximum LR.
315
-
316
- ### Naming convention and parameter count
317
-
318
- *Pythia* models were renamed in January 2023. It is possible that the old
319
- naming convention still persists in some documentation by accident. The
320
- current naming convention (70M, 160M, etc.) is based on total parameter count.
321
-
322
- <figure style="width:32em">
323
-
324
- | current Pythia suffix | old suffix | total params | non-embedding params |
325
- | --------------------: | ---------: | -------------: | -------------------: |
326
- | 70M | 19M | 70,426,624 | 18,915,328 |
327
- | 160M | 125M | 162,322,944 | 85,056,000 |
328
- | 410M | 350M | 405,334,016 | 302,311,424 |
329
- | 1B | 800M | 1,011,781,632 | 805,736,448 |
330
- | 1.4B | 1.3B | 1,414,647,808 | 1,208,602,624 |
331
- | 2.8B | 2.7B | 2,775,208,960 | 2,517,652,480 |
332
- | 6.9B | 6.7B | 6,857,302,016 | 6,444,163,072 |
333
- | 12B | 13B | 11,846,072,320 | 11,327,027,200 |
334
- </figure>
 
35
 
36
  *Tested on KoboldCpp with OpenBLAS enabled.*
37
 
 
 
 
 
38
  **Versions:**
39
 
40
  **2023-04-20:** *q4_3. Used [commit 05f3079](https://github.com/ggerganov/ggml/tree/05f307971862b83df12fada0c42ee027ba5a82b5/examples/stablelm)*
 
47
 
48
  **2023-05-25:** *New quantization format (ggmlv3). q4_0 and q5_1, up to 2.8B. Used [commit 73ad593](https://github.com/ggerganov/ggml/tree/73ad593cf84f864f0fcfd3a196253575c70d66a2/examples/gpt-neox)*
49
 
50
+ **Notes:**
51
+ - The models have been converted with ggerganov/ggml's gpt-neox conversion script, and tested only on KoboldCpp. Other frontends that support GGML-based conversions of GPT-NeoX *should* work, but I can't promise anything.
52
+ - They're sorted by date based on when they were converted so it was easier to track breaking changes. If you're just starting off I highly recommend the latest, which is currently 2023-05-25. Combined with KoboldCpp v1.25.1+ this improved the tokenizer, which in my testing reduces occurrences of broken words like "Alicae" or "Reimu Hai-ku-rei".
53
 
54
  # ALTERNATIVES
55
+ If you're here because you want a smaller model to run on a device with constrained memory, consider the following, most (if not all) of which have GGML conversions available:
56
+ - [**RedPajama-INCITE**](https://huggingface.co/togethercomputer/RedPajama-INCITE-Base-3B-v1) (3B, 7B), using the GPT-NeoX architecture
57
+ - [**OpenLLaMA**](https://huggingface.co/openlm-research/open_llama_3b_600bt_preview) (3B, 7B), using the LLaMA architecture
58
+ - [**MPT-1b-RedPajama-200b**](https://huggingface.co/mosaicml/mpt-1b-redpajama-200b) (1B), using the MPT architecture
59
+ - [**RWKV-4 PilePlus**](https://huggingface.co/BlinkDL/rwkv-4-pileplus) (169M, 430M, 1.5B, 3B), using the RWKV architecture
60
+ - [**GPT-2**](https://huggingface.co/gpt2-xl) (124M, 355M, 774M, 1.5B), using the GPT-2 architecture