munish0838 commited on
Commit
f1a9528
1 Parent(s): 4cfde10

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +47 -1
README.md CHANGED
@@ -10,4 +10,50 @@ tags:
10
  - RLHF
11
  - conversational
12
  - reward model
13
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  - RLHF
11
  - conversational
12
  - reward model
13
+ ---
14
+
15
+ ---
16
+ license: apache-2.0
17
+ datasets:
18
+ - berkeley-nest/Nectar
19
+ language:
20
+ - en
21
+ library_name: transformers
22
+ tags:
23
+ - reward model
24
+ - RLHF
25
+ - RLAIF
26
+ ---
27
+ # Starling-LM-7B-beta-GGUF
28
+
29
+ - Model creator: [Nexusflow](https://huggingface.co/Nexusflow)
30
+ - Original model: [Starling-LM-7B-beta](https://huggingface.co/Nexusflow/Starling-LM-7B-beta)
31
+
32
+ <!-- description start -->
33
+ ## Description
34
+
35
+ This repo contains GGUF format model files for [Starling-LM-7B-beta](https://huggingface.co/Nexusflow/Starling-LM-7B-beta)
36
+
37
+ **Model Summary**
38
+ <!-- Provide a quick summary of what the model is/does. -->
39
+
40
+ - **Developed by: The Nexusflow Team (** Banghua Zhu * , Evan Frick * , Tianhao Wu * , Hanlin Zhu, Karthik Ganesan, Wei-Lin Chiang, Jian Zhang, and Jiantao Jiao).
41
+ - **Model type:** Language Model finetuned with RLHF / RLAIF
42
+ - **License:** Apache-2.0 license under the condition that the model is not used to compete with OpenAI
43
+ - **Finetuned from model:** [Openchat-3.5-0106](https://huggingface.co/openchat/openchat-3.5-0106) (based on [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1))
44
+
45
+
46
+ We introduce Starling-LM-7B-beta, an open large language model (LLM) trained by Reinforcement Learning from AI Feedback (RLAIF). Starling-LM-7B-beta is trained from [Openchat-3.5-0106](https://huggingface.co/openchat/openchat-3.5-0106) with our new reward model [Nexusflow/Starling-RM-34B](https://huggingface.co/Nexusflow/Starling-RM-34B) and policy optimization method [Fine-Tuning Language Models from Human Preferences (PPO)](https://arxiv.org/abs/1909.08593).
47
+ Harnessing the power of the ranking dataset, [berkeley-nest/Nectar](https://huggingface.co/datasets/berkeley-nest/Nectar), the upgraded reward model, [Starling-RM-34B](https://huggingface.co/Nexusflow/Starling-RM-34B), and the new reward training and policy tuning pipeline, Starling-LM-7B-beta scores an improved 8.12 in MT Bench with GPT-4 as a judge.
48
+
49
+
50
+ ## Citation
51
+ ```
52
+ @misc{starling2023,
53
+ title = {Starling-7B: Improving LLM Helpfulness & Harmlessness with RLAIF},
54
+ url = {},
55
+ author = {Zhu, Banghua and Frick, Evan and Wu, Tianhao and Zhu, Hanlin and Ganesan, Karthik and Chiang, Wei-Lin and Zhang, Jian and Jiao, Jiantao},
56
+ month = {November},
57
+ year = {2023}
58
+ }
59
+ ```