Text Generation
Transformers
English
llama
TheBloke commited on
Commit
3fd30d8
·
1 Parent(s): ec85c05

Initial GGML model commit

Browse files
Files changed (1) hide show
  1. README.md +42 -18
README.md CHANGED
@@ -166,13 +166,13 @@ Thank you to all my generous patrons and donaters!
166
  We have used our own [OpenOrca dataset](https://huggingface.co/datasets/Open-Orca/OpenOrca) to fine-tune Llama2-13B using [OpenChat](https://huggingface.co/openchat) packing and conditional behavior cloning.
167
  This dataset is our attempt to reproduce the dataset generated for Microsoft Research's [Orca Paper](https://arxiv.org/abs/2306.02707).
168
 
169
- This second preview release is trained on a curated filtered subset of most of our GPT4 augmented data.
170
 
171
  This release highlights that our dataset and training methods have surpassed performance parity with the Orca paper.
172
- We measured this with BigBench-Hard and AGIEval results with the same methods as used in the Orca paper, finding ~103% of original Orca's performance on average.
173
- As well, this is done with ~1/10th the compute requirement and using <20% of the dataset size from the original Orca paper.
174
 
175
- We have run extensive evaluations internally and expect this model to place number 1 on both the HuggingFaceH4 Open LLM Leaderboard and the GPT4ALL Leaderboard for 13B models.
176
 
177
  "One" of [OpenChat](https://huggingface.co/openchat) has joined our team, and we'd like to provide special thanks for their training of this model!
178
  We have utilized OpenChat conditional behavior cloning and [MultiPack algorithm](https://github.com/imoneoi/multipack_sampler) which achieves 99.85% bin-packing efficiency on our dataset.
@@ -201,46 +201,58 @@ We have evaluated **OpenOrcaxOpenChat-Preview2-13B** on hard reasoning tasks fro
201
 
202
  Our average performance for BigBench-Hard: 0.488
203
 
204
- Average for AGIEval: 0.441
205
 
206
  In the Orca paper, they measured their score relative to Vicuna on these evals.
207
- We've done the same and have found our score averages to >103% of the total improvement that was shown in the Orca paper, using the same evaluation methods as outlined in the paper.
208
 
209
- So we are surpassing Orca performance with <20% of the dataset size and ~1/10th the training budget!
210
 
211
- ## BigBench-Hard Performance
212
-
213
- ![OpenOrca Preview2 BigBench-Hard Performance](https://huggingface.co/Open-Orca/OpenOrcaxOpenChat-Preview2-13B/resolve/main/OO_Preview2_BigBenchHard.png "BigBench-Hard Performance")
214
 
215
  ## AGIEval Performance
216
 
217
- ![OpenOrca Preview2 AGIEval Performance](https://huggingface.co/Open-Orca/OpenOrcaxOpenChat-Preview2-13B/resolve/main/OO_Preview2_AGIEval.png "AGIEval Performance")
 
 
 
 
 
 
 
 
 
 
 
 
218
 
219
  ## HuggingFaceH4 Open LLM Leaderboard Performance
220
 
221
  We have run our own tests using parameters matching the [HuggingFaceH4 Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) evals.
222
- We find
223
 
224
- ![OpenOrca Preview2 HuggingFace Leaderboard Performance](https://huggingface.co/Open-Orca/OpenOrcaxOpenChat-Preview2-13B/resolve/main/OO_Preview2_HFLeaderboard.png "GPT4ALL Performance")
 
 
225
 
226
  ## GPT4ALL Leaderboard Performance
227
 
228
  We have tested using parameters matching the GPT4ALL Benchmark Suite and report our results and placement vs their official reporting below.
229
- We place #1 for all open models and come within comparison of text-davinci-003, a proprietary model an order of magnitude larger.
230
 
231
- ![OpenOrca Preview2 GPT4ALL Performance](https://huggingface.co/Open-Orca/OpenOrcaxOpenChat-Preview2-13B/resolve/main/OO_Preview2_AGIEval.png "GPT4ALL Performance")
 
 
232
 
233
 
234
  # Dataset
235
 
236
  We used a curated, filtered selection of most of the GPT-4 augmented data from our OpenOrca dataset, which aims to reproduce the Orca Research Paper dataset.
237
- Further details of our curation practices will be forthcoming with our full model release.
238
 
239
 
240
  # Training
241
 
242
- We trained with 8x A100-80G GPUs for 46 hours, completing 5 epochs of full fine tuning on our dataset.
243
- This contrasts with the 20x A100-80G GPUs for 200 hours used in the Orca paper, for only 3 epochs.
244
  Our compute requirement was <1/10th that of the original Orca.
245
  Commodity cost was ~$600.
246
 
@@ -263,6 +275,18 @@ tokenize("User: Hello<|end_of_turn|>Assistant: Hi<|end_of_turn|>User: How are yo
263
  # Result: [1, 4911, 29901, 15043, 32000, 4007, 22137, 29901, 6324, 32000, 4911, 29901, 1128, 526, 366, 9826, 29973, 32000, 4007, 22137, 29901]
264
  ```
265
 
 
 
 
 
 
 
 
 
 
 
 
 
266
 
267
  # Serving
268
 
 
166
  We have used our own [OpenOrca dataset](https://huggingface.co/datasets/Open-Orca/OpenOrca) to fine-tune Llama2-13B using [OpenChat](https://huggingface.co/openchat) packing and conditional behavior cloning.
167
  This dataset is our attempt to reproduce the dataset generated for Microsoft Research's [Orca Paper](https://arxiv.org/abs/2306.02707).
168
 
169
+ This second preview release is trained on a curated filtered subset of most of our GPT-4 augmented data.
170
 
171
  This release highlights that our dataset and training methods have surpassed performance parity with the Orca paper.
172
+ We measured this with BigBench-Hard and AGIEval results with the same methods as used in the Orca paper, finding **~103%** of original Orca's performance on average.
173
+ As well, this is done with <1/10th the compute requirement and using <20% of the dataset size from the original Orca paper.
174
 
175
+ We have run extensive evaluations internally and expect this model to **place number 1** on both the HuggingFaceH4 Open LLM Leaderboard and the GPT4ALL Leaderboard for 13B models.
176
 
177
  "One" of [OpenChat](https://huggingface.co/openchat) has joined our team, and we'd like to provide special thanks for their training of this model!
178
  We have utilized OpenChat conditional behavior cloning and [MultiPack algorithm](https://github.com/imoneoi/multipack_sampler) which achieves 99.85% bin-packing efficiency on our dataset.
 
201
 
202
  Our average performance for BigBench-Hard: 0.488
203
 
204
+ Average for AGIEval: 0.447
205
 
206
  In the Orca paper, they measured their score relative to Vicuna on these evals.
207
+ We have done the same and have found our score averages to **~103%** of the total performance that was shown in the Orca paper, using the same evaluation methods as outlined in the paper.
208
 
209
+ So we are surpassing Orca performance with <20% of the dataset size and <1/10th the training budget!
210
 
211
+ As well, we have evaluated using the methodology and tools for the HuggingFace Leaderboard and GPT4ALL Leaderboard, and find that we place #1 on both for all 13B models at release time!
 
 
212
 
213
  ## AGIEval Performance
214
 
215
+ We present our results in two columns.
216
+ The column for "`(Orca Paper eval)`" uses the methods outlined in the Orca paper, so as to be a direct apples-to-apples comparison with the results from the paper.
217
+ The column for "`(HF Leaderboard eval)`" uses EleutherAI's LM Evaluation Harness with settings outlined by HuggingFace. These results are not comparable to the other columns, as the methods are different.
218
+
219
+ ![OpenOrca Preview2 AGIEval Performance](https://huggingface.co/Open-Orca/OpenOrcaxOpenChat-Preview2-13B/resolve/main/Images/OpenOrcaP2AGIEval.png "AGIEval Performance")
220
+
221
+ ## BigBench-Hard Performance
222
+
223
+ We present our results in two columns.
224
+ The column for "`(Orca Paper eval)`" uses the methods outlined in the Orca paper, so as to be a direct apples-to-apples comparison with the results from the paper.
225
+ The column for "`(HF Leaderboard eval)`" uses EleutherAI's LM Evaluation Harness with settings outlined by HuggingFace. These results are not comparable to the other columns, as the methods are different.
226
+
227
+ ![OpenOrca Preview2 BigBench-Hard Performance](https://huggingface.co/Open-Orca/OpenOrcaxOpenChat-Preview2-13B/resolve/main/Images/OpenOrcaP2BigBenchHardEval.png "BigBench-Hard Performance")
228
 
229
  ## HuggingFaceH4 Open LLM Leaderboard Performance
230
 
231
  We have run our own tests using parameters matching the [HuggingFaceH4 Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) evals.
 
232
 
233
+ We place #1 for all 13B models at release time!
234
+
235
+ ![OpenOrca Preview2 HuggingFace Leaderboard Performance](https://huggingface.co/Open-Orca/OpenOrcaxOpenChat-Preview2-13B/resolve/main/Images/OpenOrcaP2HuggingFaceLeaderboard.png "GPT4ALL Performance")
236
 
237
  ## GPT4ALL Leaderboard Performance
238
 
239
  We have tested using parameters matching the GPT4ALL Benchmark Suite and report our results and placement vs their official reporting below.
 
240
 
241
+ We place #1 for all open models and come within comparison of `text-davinci-003`, a proprietary OpenAI model an order of magnitude larger.
242
+
243
+ ![OpenOrca Preview2 GPT4ALL Performance](https://huggingface.co/Open-Orca/OpenOrcaxOpenChat-Preview2-13B/resolve/main/Images/OpenOrcaP2GPT4ALL_Leaderboard.png "GPT4ALL Performance")
244
 
245
 
246
  # Dataset
247
 
248
  We used a curated, filtered selection of most of the GPT-4 augmented data from our OpenOrca dataset, which aims to reproduce the Orca Research Paper dataset.
249
+ Further details of our curation practices will be forthcoming with our full model releases.
250
 
251
 
252
  # Training
253
 
254
+ We trained with 8x A100-80G GPUs for 46 hours, completing 5 epochs of full fine tuning on our dataset in one training run.
255
+ This contrasts with the 20x A100-80G GPUs for 200 hours used in the Orca paper, for only 3 epochs, and requiring stacked training (which is known to suffer catastrophic forgetting).
256
  Our compute requirement was <1/10th that of the original Orca.
257
  Commodity cost was ~$600.
258
 
 
275
  # Result: [1, 4911, 29901, 15043, 32000, 4007, 22137, 29901, 6324, 32000, 4911, 29901, 1128, 526, 366, 9826, 29973, 32000, 4007, 22137, 29901]
276
  ```
277
 
278
+ For UIs with Prefix and Suffix fields, these will likely work:
279
+
280
+ Prefix (include a space after colon):
281
+ ```
282
+ User:
283
+ ```
284
+
285
+ Suffix (space after colon):
286
+ ```
287
+ <|end_of_turn|>\nAssistant:
288
+ ```
289
+
290
 
291
  # Serving
292