File size: 8,886 Bytes
8913167
 
 
77dcae9
 
8913167
 
 
 
77e1424
72a4725
 
 
77dcae9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8913167
 
746ac8c
8913167
746ac8c
addf8f7
746ac8c
8913167
746ac8c
 
 
8913167
ade9451
 
b2de804
 
746ac8c
 
 
b2de804
746ac8c
8913167
746ac8c
8913167
 
 
 
 
746ac8c
8913167
746ac8c
b84481c
8913167
 
 
 
 
 
 
 
 
 
 
 
e697f0e
8913167
746ac8c
8913167
746ac8c
8913167
746ac8c
8913167
746ac8c
90d5064
 
 
 
 
 
 
 
 
746ac8c
f749832
 
8913167
 
 
 
 
 
746ac8c
8913167
 
 
 
 
 
 
 
 
 
 
 
 
27f5991
8913167
 
 
 
 
 
 
 
 
 
 
 
274bca0
 
 
 
 
 
 
 
77dcae9
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
---
language:
- eng
license:
- mit
tags:
- llama-2
- sft
datasets:
- LDJnr/Capybara
- LDJnr/LessWrong-Amplify-Instruct
- LDJnr/Pure-Dove
- LDJnr/Verified-Camel
model-index:
- name: Nous-Capybara-7B
  results:
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: AI2 Reasoning Challenge (25-Shot)
      type: ai2_arc
      config: ARC-Challenge
      split: test
      args:
        num_few_shot: 25
    metrics:
    - type: acc_norm
      value: 55.29
      name: normalized accuracy
    source:
      url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=NousResearch/Nous-Capybara-7B
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: HellaSwag (10-Shot)
      type: hellaswag
      split: validation
      args:
        num_few_shot: 10
    metrics:
    - type: acc_norm
      value: 80.73
      name: normalized accuracy
    source:
      url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=NousResearch/Nous-Capybara-7B
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: MMLU (5-Shot)
      type: cais/mmlu
      config: all
      split: test
      args:
        num_few_shot: 5
    metrics:
    - type: acc
      value: 48.72
      name: accuracy
    source:
      url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=NousResearch/Nous-Capybara-7B
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: TruthfulQA (0-shot)
      type: truthful_qa
      config: multiple_choice
      split: validation
      args:
        num_few_shot: 0
    metrics:
    - type: mc2
      value: 51.13
    source:
      url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=NousResearch/Nous-Capybara-7B
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: Winogrande (5-shot)
      type: winogrande
      config: winogrande_xl
      split: validation
      args:
        num_few_shot: 5
    metrics:
    - type: acc
      value: 73.32
      name: accuracy
    source:
      url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=NousResearch/Nous-Capybara-7B
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: GSM8k (5-shot)
      type: gsm8k
      config: main
      split: test
      args:
        num_few_shot: 5
    metrics:
    - type: acc
      value: 6.97
      name: accuracy
    source:
      url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=NousResearch/Nous-Capybara-7B
      name: Open LLM Leaderboard
---

## **Nous-Capybara-7B V1**

**MUCH BETTER MISTRAL BASED VERSION IS OUT NOW AS CAPYBARA V1.9**

The Capybara series is made by fine-tuning on data that is created by Nous with our novel data synthesis technique called Amplify-instruct, the seed distribution and synthesis method are comprised of a synergistic combination of top performing existing data synthesis techniques and distributions used for SOTA models such as Airoboros, Evol-Instruct, Orca, Vicuna, Know_Logic, Lamini, FLASK and others, all into one lean holistically formed dataset and model. The seed instructions used for the start of synthesized conversations are largely based on highly datasets like Airoboros, Know logic, EverythingLM, GPTeacher and even entirely new seed instructions derived from posts on the website LessWrong, as well as being supplemented with certain in-house multi-turn datasets like Dove(A successor to Puffin).

While performing great in it's current state, the current dataset used for fine-tuning is entirely contained within 20K training examples, mostly comprised of newly synthesized conversation tokens that have never previously been used for AI training to our knowledge.

This small fine-tune dataset has significant implications for how we'll be able to scale model abilities in the future! This model is currently 20K examples while matching benchmarks of notable 300K example datasets that are 10 times the size!

## Process of creation and special thank yous!

This model was fine-tuned by Nous Research, with LDJ leading the training and dataset curation, along with significant dataset formation contributions by J-Supha, Also thank you to Emozilla for also assisting to expedite the training experimentation process.

Special thank you to **A16Z** for sponsoring our training, as well as **Yield Protocol** for their support in resources during R&D of aspects outside of training, such as dataset development/synthesis.

## Thank you to those of you that have indirectly contributed!

While most of the tokens within Capybara are newly synthsized and part of datasets like Puffin/Dove, we would like to credit the single-turn datasets we leveraged as seeds that are used to generate the multi-turn data as part of the Amplify-Instruct synthesis.

The datasets shown in green below are datasets that we sampled from to curate seeds that are used during Amplify-Instruct synthesis for this project.

![Capybara](https://i.imgur.com/yB58OoD.jpeg)

## Model Training

Nous-Capybara 7B is a new model trained for multiple epochs on a dataset of roughly 20,000 carefully curated conversational examples, most of which are comprised of entirely new in-house synthesized tokens that previously didn't exist on HuggingFace.

Additional data came from manually curated CamelAI data, with the help of volunteers ranging from former Physics PhD's, Mathematicians, Biologists and more! 

## Prompt Format

The reccomended model usage is:

```
USER:

ASSISTANT:
```

## Notable Features:

 - The first Nous model trained on over 10,000 multi-turn conversations.

 - Over 1,000 tokens average per conversation example and multiple back and forth turns per conversation! Most models are still trained for only single-turn conversations and less than 300 tokens per example!

 - Able to effectively do complex summaries of advanced topics and studies.

 - Ability to recall information upto late 2022 without internet.

 - Includes a portion of conversational data synthesized from less wrong posts, discussing very in-depth about the nature of rationality, reasoning, self-improvement and related concepts.

## Example Outputs!:

![Capybara](https://img001.prntscr.com/file/img001/T9yYxR1xQSaK_UGdy3t2Cw.png)

![Capybara](https://img001.prntscr.com/file/img001/DQXqmKbsQQOIcgny1eoGNA.png)

![Capybara](https://img001.prntscr.com/file/img001/85X3L9ZxTsOKo3fUQ7GRVA.png)

## Benchmarks! (Important to note that all mentioned benchmarks are single-turn and don't test multi-turn capabilities, Capybara should excel even further at multi-turn conversational tasks than what benchmark comparisons show.)

![Capybara](https://i.imgur.com/n8lkmyK.png)
 

## Future Changes

This is a relatively early build amongst the grand plans for the future of Capybara! 

[IT IS NOW RECCOMENDED TO USE CAPYBARA V1.9 FOR SIGNIFICANTLY BETTER OVERALL CAPABILITIES]

## Future model sizes

We plan on releasing a 3B, 13B and 70B version, as well as a potential 1B version based on phi-1.5 or similar architectures.

## How you can help!

In the near future we plan on leveraging the help of domain specific expert volunteers to eliminate any mathematically/verifiably incorrect answers from our training curations. 

If you have at-least a bachelors in mathematics, physics, biology or chemistry and would like to volunteer even just 30 minutes of your expertise time, please contact LDJ on discord!

## Dataset contamination.

We checked for 100%, 99%, 98% and 97% similarity matches between our data and many popular benchmarks, we found no matches!

The following are benchmarks we checked for contamination for:

- HumanEval

- AGIEval

- TruthfulQA

- MMLU

- GPT4All

```
@article{daniele2023amplify-instruct,
  title={Amplify-Instruct: Synthetically Generated Diverse Multi-turn Conversations for Effecient LLM Training.},
  author={Daniele, Luigi and Suphavadeeprasit},
  journal={arXiv preprint arXiv:(comming soon)},
  year={2023}
}
```
# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_NousResearch__Nous-Capybara-7B)

|             Metric              |Value|
|---------------------------------|----:|
|Avg.                             |52.70|
|AI2 Reasoning Challenge (25-Shot)|55.29|
|HellaSwag (10-Shot)              |80.73|
|MMLU (5-Shot)                    |48.72|
|TruthfulQA (0-shot)              |51.13|
|Winogrande (5-shot)              |73.32|
|GSM8k (5-shot)                   | 6.97|