Initial GGML model commit
Browse files
README.md
CHANGED
@@ -166,13 +166,13 @@ Thank you to all my generous patrons and donaters!
|
|
166 |
We have used our own [OpenOrca dataset](https://huggingface.co/datasets/Open-Orca/OpenOrca) to fine-tune Llama2-13B using [OpenChat](https://huggingface.co/openchat) packing and conditional behavior cloning.
|
167 |
This dataset is our attempt to reproduce the dataset generated for Microsoft Research's [Orca Paper](https://arxiv.org/abs/2306.02707).
|
168 |
|
169 |
-
This second preview release is trained on a curated filtered subset of most of our
|
170 |
|
171 |
This release highlights that our dataset and training methods have surpassed performance parity with the Orca paper.
|
172 |
-
We measured this with BigBench-Hard and AGIEval results with the same methods as used in the Orca paper, finding
|
173 |
-
As well, this is done with
|
174 |
|
175 |
-
We have run extensive evaluations internally and expect this model to place number 1 on both the HuggingFaceH4 Open LLM Leaderboard and the GPT4ALL Leaderboard for 13B models.
|
176 |
|
177 |
"One" of [OpenChat](https://huggingface.co/openchat) has joined our team, and we'd like to provide special thanks for their training of this model!
|
178 |
We have utilized OpenChat conditional behavior cloning and [MultiPack algorithm](https://github.com/imoneoi/multipack_sampler) which achieves 99.85% bin-packing efficiency on our dataset.
|
@@ -201,46 +201,58 @@ We have evaluated **OpenOrcaxOpenChat-Preview2-13B** on hard reasoning tasks fro
|
|
201 |
|
202 |
Our average performance for BigBench-Hard: 0.488
|
203 |
|
204 |
-
Average for AGIEval: 0.
|
205 |
|
206 |
In the Orca paper, they measured their score relative to Vicuna on these evals.
|
207 |
-
We
|
208 |
|
209 |
-
So we are surpassing Orca performance with <20% of the dataset size and
|
210 |
|
211 |
-
|
212 |
-
|
213 |
-
![OpenOrca Preview2 BigBench-Hard Performance](https://huggingface.co/Open-Orca/OpenOrcaxOpenChat-Preview2-13B/resolve/main/OO_Preview2_BigBenchHard.png "BigBench-Hard Performance")
|
214 |
|
215 |
## AGIEval Performance
|
216 |
|
217 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
218 |
|
219 |
## HuggingFaceH4 Open LLM Leaderboard Performance
|
220 |
|
221 |
We have run our own tests using parameters matching the [HuggingFaceH4 Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) evals.
|
222 |
-
We find
|
223 |
|
224 |
-
|
|
|
|
|
225 |
|
226 |
## GPT4ALL Leaderboard Performance
|
227 |
|
228 |
We have tested using parameters matching the GPT4ALL Benchmark Suite and report our results and placement vs their official reporting below.
|
229 |
-
We place #1 for all open models and come within comparison of text-davinci-003, a proprietary model an order of magnitude larger.
|
230 |
|
231 |
-
|
|
|
|
|
232 |
|
233 |
|
234 |
# Dataset
|
235 |
|
236 |
We used a curated, filtered selection of most of the GPT-4 augmented data from our OpenOrca dataset, which aims to reproduce the Orca Research Paper dataset.
|
237 |
-
Further details of our curation practices will be forthcoming with our full model
|
238 |
|
239 |
|
240 |
# Training
|
241 |
|
242 |
-
We trained with 8x A100-80G GPUs for 46 hours, completing 5 epochs of full fine tuning on our dataset.
|
243 |
-
This contrasts with the 20x A100-80G GPUs for 200 hours used in the Orca paper, for only 3 epochs.
|
244 |
Our compute requirement was <1/10th that of the original Orca.
|
245 |
Commodity cost was ~$600.
|
246 |
|
@@ -263,6 +275,18 @@ tokenize("User: Hello<|end_of_turn|>Assistant: Hi<|end_of_turn|>User: How are yo
|
|
263 |
# Result: [1, 4911, 29901, 15043, 32000, 4007, 22137, 29901, 6324, 32000, 4911, 29901, 1128, 526, 366, 9826, 29973, 32000, 4007, 22137, 29901]
|
264 |
```
|
265 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
266 |
|
267 |
# Serving
|
268 |
|
|
|
166 |
We have used our own [OpenOrca dataset](https://huggingface.co/datasets/Open-Orca/OpenOrca) to fine-tune Llama2-13B using [OpenChat](https://huggingface.co/openchat) packing and conditional behavior cloning.
|
167 |
This dataset is our attempt to reproduce the dataset generated for Microsoft Research's [Orca Paper](https://arxiv.org/abs/2306.02707).
|
168 |
|
169 |
+
This second preview release is trained on a curated filtered subset of most of our GPT-4 augmented data.
|
170 |
|
171 |
This release highlights that our dataset and training methods have surpassed performance parity with the Orca paper.
|
172 |
+
We measured this with BigBench-Hard and AGIEval results with the same methods as used in the Orca paper, finding **~103%** of original Orca's performance on average.
|
173 |
+
As well, this is done with <1/10th the compute requirement and using <20% of the dataset size from the original Orca paper.
|
174 |
|
175 |
+
We have run extensive evaluations internally and expect this model to **place number 1** on both the HuggingFaceH4 Open LLM Leaderboard and the GPT4ALL Leaderboard for 13B models.
|
176 |
|
177 |
"One" of [OpenChat](https://huggingface.co/openchat) has joined our team, and we'd like to provide special thanks for their training of this model!
|
178 |
We have utilized OpenChat conditional behavior cloning and [MultiPack algorithm](https://github.com/imoneoi/multipack_sampler) which achieves 99.85% bin-packing efficiency on our dataset.
|
|
|
201 |
|
202 |
Our average performance for BigBench-Hard: 0.488
|
203 |
|
204 |
+
Average for AGIEval: 0.447
|
205 |
|
206 |
In the Orca paper, they measured their score relative to Vicuna on these evals.
|
207 |
+
We have done the same and have found our score averages to **~103%** of the total performance that was shown in the Orca paper, using the same evaluation methods as outlined in the paper.
|
208 |
|
209 |
+
So we are surpassing Orca performance with <20% of the dataset size and <1/10th the training budget!
|
210 |
|
211 |
+
As well, we have evaluated using the methodology and tools for the HuggingFace Leaderboard and GPT4ALL Leaderboard, and find that we place #1 on both for all 13B models at release time!
|
|
|
|
|
212 |
|
213 |
## AGIEval Performance
|
214 |
|
215 |
+
We present our results in two columns.
|
216 |
+
The column for "`(Orca Paper eval)`" uses the methods outlined in the Orca paper, so as to be a direct apples-to-apples comparison with the results from the paper.
|
217 |
+
The column for "`(HF Leaderboard eval)`" uses EleutherAI's LM Evaluation Harness with settings outlined by HuggingFace. These results are not comparable to the other columns, as the methods are different.
|
218 |
+
|
219 |
+
![OpenOrca Preview2 AGIEval Performance](https://huggingface.co/Open-Orca/OpenOrcaxOpenChat-Preview2-13B/resolve/main/Images/OpenOrcaP2AGIEval.png "AGIEval Performance")
|
220 |
+
|
221 |
+
## BigBench-Hard Performance
|
222 |
+
|
223 |
+
We present our results in two columns.
|
224 |
+
The column for "`(Orca Paper eval)`" uses the methods outlined in the Orca paper, so as to be a direct apples-to-apples comparison with the results from the paper.
|
225 |
+
The column for "`(HF Leaderboard eval)`" uses EleutherAI's LM Evaluation Harness with settings outlined by HuggingFace. These results are not comparable to the other columns, as the methods are different.
|
226 |
+
|
227 |
+
![OpenOrca Preview2 BigBench-Hard Performance](https://huggingface.co/Open-Orca/OpenOrcaxOpenChat-Preview2-13B/resolve/main/Images/OpenOrcaP2BigBenchHardEval.png "BigBench-Hard Performance")
|
228 |
|
229 |
## HuggingFaceH4 Open LLM Leaderboard Performance
|
230 |
|
231 |
We have run our own tests using parameters matching the [HuggingFaceH4 Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) evals.
|
|
|
232 |
|
233 |
+
We place #1 for all 13B models at release time!
|
234 |
+
|
235 |
+
![OpenOrca Preview2 HuggingFace Leaderboard Performance](https://huggingface.co/Open-Orca/OpenOrcaxOpenChat-Preview2-13B/resolve/main/Images/OpenOrcaP2HuggingFaceLeaderboard.png "GPT4ALL Performance")
|
236 |
|
237 |
## GPT4ALL Leaderboard Performance
|
238 |
|
239 |
We have tested using parameters matching the GPT4ALL Benchmark Suite and report our results and placement vs their official reporting below.
|
|
|
240 |
|
241 |
+
We place #1 for all open models and come within comparison of `text-davinci-003`, a proprietary OpenAI model an order of magnitude larger.
|
242 |
+
|
243 |
+
![OpenOrca Preview2 GPT4ALL Performance](https://huggingface.co/Open-Orca/OpenOrcaxOpenChat-Preview2-13B/resolve/main/Images/OpenOrcaP2GPT4ALL_Leaderboard.png "GPT4ALL Performance")
|
244 |
|
245 |
|
246 |
# Dataset
|
247 |
|
248 |
We used a curated, filtered selection of most of the GPT-4 augmented data from our OpenOrca dataset, which aims to reproduce the Orca Research Paper dataset.
|
249 |
+
Further details of our curation practices will be forthcoming with our full model releases.
|
250 |
|
251 |
|
252 |
# Training
|
253 |
|
254 |
+
We trained with 8x A100-80G GPUs for 46 hours, completing 5 epochs of full fine tuning on our dataset in one training run.
|
255 |
+
This contrasts with the 20x A100-80G GPUs for 200 hours used in the Orca paper, for only 3 epochs, and requiring stacked training (which is known to suffer catastrophic forgetting).
|
256 |
Our compute requirement was <1/10th that of the original Orca.
|
257 |
Commodity cost was ~$600.
|
258 |
|
|
|
275 |
# Result: [1, 4911, 29901, 15043, 32000, 4007, 22137, 29901, 6324, 32000, 4911, 29901, 1128, 526, 366, 9826, 29973, 32000, 4007, 22137, 29901]
|
276 |
```
|
277 |
|
278 |
+
For UIs with Prefix and Suffix fields, these will likely work:
|
279 |
+
|
280 |
+
Prefix (include a space after colon):
|
281 |
+
```
|
282 |
+
User:
|
283 |
+
```
|
284 |
+
|
285 |
+
Suffix (space after colon):
|
286 |
+
```
|
287 |
+
<|end_of_turn|>\nAssistant:
|
288 |
+
```
|
289 |
+
|
290 |
|
291 |
# Serving
|
292 |
|