Commit
•
f1a9528
1
Parent(s):
4cfde10
Update README.md
Browse files
README.md
CHANGED
@@ -10,4 +10,50 @@ tags:
|
|
10 |
- RLHF
|
11 |
- conversational
|
12 |
- reward model
|
13 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
10 |
- RLHF
|
11 |
- conversational
|
12 |
- reward model
|
13 |
+
---
|
14 |
+
|
15 |
+
---
|
16 |
+
license: apache-2.0
|
17 |
+
datasets:
|
18 |
+
- berkeley-nest/Nectar
|
19 |
+
language:
|
20 |
+
- en
|
21 |
+
library_name: transformers
|
22 |
+
tags:
|
23 |
+
- reward model
|
24 |
+
- RLHF
|
25 |
+
- RLAIF
|
26 |
+
---
|
27 |
+
# Starling-LM-7B-beta-GGUF
|
28 |
+
|
29 |
+
- Model creator: [Nexusflow](https://huggingface.co/Nexusflow)
|
30 |
+
- Original model: [Starling-LM-7B-beta](https://huggingface.co/Nexusflow/Starling-LM-7B-beta)
|
31 |
+
|
32 |
+
<!-- description start -->
|
33 |
+
## Description
|
34 |
+
|
35 |
+
This repo contains GGUF format model files for [Starling-LM-7B-beta](https://huggingface.co/Nexusflow/Starling-LM-7B-beta)
|
36 |
+
|
37 |
+
**Model Summary**
|
38 |
+
<!-- Provide a quick summary of what the model is/does. -->
|
39 |
+
|
40 |
+
- **Developed by: The Nexusflow Team (** Banghua Zhu * , Evan Frick * , Tianhao Wu * , Hanlin Zhu, Karthik Ganesan, Wei-Lin Chiang, Jian Zhang, and Jiantao Jiao).
|
41 |
+
- **Model type:** Language Model finetuned with RLHF / RLAIF
|
42 |
+
- **License:** Apache-2.0 license under the condition that the model is not used to compete with OpenAI
|
43 |
+
- **Finetuned from model:** [Openchat-3.5-0106](https://huggingface.co/openchat/openchat-3.5-0106) (based on [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1))
|
44 |
+
|
45 |
+
|
46 |
+
We introduce Starling-LM-7B-beta, an open large language model (LLM) trained by Reinforcement Learning from AI Feedback (RLAIF). Starling-LM-7B-beta is trained from [Openchat-3.5-0106](https://huggingface.co/openchat/openchat-3.5-0106) with our new reward model [Nexusflow/Starling-RM-34B](https://huggingface.co/Nexusflow/Starling-RM-34B) and policy optimization method [Fine-Tuning Language Models from Human Preferences (PPO)](https://arxiv.org/abs/1909.08593).
|
47 |
+
Harnessing the power of the ranking dataset, [berkeley-nest/Nectar](https://huggingface.co/datasets/berkeley-nest/Nectar), the upgraded reward model, [Starling-RM-34B](https://huggingface.co/Nexusflow/Starling-RM-34B), and the new reward training and policy tuning pipeline, Starling-LM-7B-beta scores an improved 8.12 in MT Bench with GPT-4 as a judge.
|
48 |
+
|
49 |
+
|
50 |
+
## Citation
|
51 |
+
```
|
52 |
+
@misc{starling2023,
|
53 |
+
title = {Starling-7B: Improving LLM Helpfulness & Harmlessness with RLAIF},
|
54 |
+
url = {},
|
55 |
+
author = {Zhu, Banghua and Frick, Evan and Wu, Tianhao and Zhu, Hanlin and Ganesan, Karthik and Chiang, Wei-Lin and Zhang, Jian and Jiao, Jiantao},
|
56 |
+
month = {November},
|
57 |
+
year = {2023}
|
58 |
+
}
|
59 |
+
```
|