Update README.md
Browse files
README.md
CHANGED
@@ -30,6 +30,13 @@ The model is based on [`rinna/bilingual-gpt-neox-4b-instruction-sft`](https://hu
|
|
30 |
* The first SFT stage produces [`rinna/bilingual-gpt-neox-4b-instruction-sft`](https://huggingface.co/rinna/bilingual-gpt-neox-4b-instruction-sft).
|
31 |
* The second RL stage produces this model.
|
32 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
33 |
* **Model Series**
|
34 |
|
35 |
| Variant | Link |
|
@@ -50,6 +57,22 @@ The model is based on [`rinna/bilingual-gpt-neox-4b-instruction-sft`](https://hu
|
|
50 |
|
51 |
---
|
52 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
53 |
# I/O Format
|
54 |
A special format has been adopted to construct inputs.
|
55 |
* An input prompt is formatted as a conversation between `ユーザー` and `システム`.
|
|
|
30 |
* The first SFT stage produces [`rinna/bilingual-gpt-neox-4b-instruction-sft`](https://huggingface.co/rinna/bilingual-gpt-neox-4b-instruction-sft).
|
31 |
* The second RL stage produces this model.
|
32 |
|
33 |
+
* **Reinforcement learning**
|
34 |
+
|
35 |
+
We used [CarperAI/trlx](https://github.com/CarperAI/trlx) and its implementation of the PPO algorithm for the RL stage.
|
36 |
+
|
37 |
+
The RL data is the subset of the following dataset and has been translated into Japanese.
|
38 |
+
* [Anthropic HH RLHF data](https://huggingface.co/datasets/Anthropic/hh-rlhf)
|
39 |
+
|
40 |
* **Model Series**
|
41 |
|
42 |
| Variant | Link |
|
|
|
57 |
|
58 |
---
|
59 |
|
60 |
+
# Benchmarking
|
61 |
+
|
62 |
+
Our evaluation experiments suggest that the PPO does not particularly improve the model's performance on the Japanese LLM benchmark in comparison with [Bilingual GPT-NeoX 4B SFT](https://huggingface.co/rinna/bilingual-gpt-neox-4b-instruction-sft), but we have seen **better conversation experience** on the PPO model than its SFT counterpart.
|
63 |
+
- *The 4-task average accuracy is based on results of JCommonsenseQA, JNLI, MARC-ja, and JSQuAD.*
|
64 |
+
- *The 6-task average accuracy is based on results of JCommonsenseQA, JNLI, MARC-ja, JSQuAD, XWinograd, and JAQKET-v2.*
|
65 |
+
|
66 |
+
| Model | 4-task average accuracy | 6-task average accuracy |
|
67 |
+
| :-- | :-- | :-- |
|
68 |
+
| **bilingual-gpt-neox-4b-instruction-ppo** | **61.01** | **61.16** |
|
69 |
+
| bilingual-gpt-neox-4b-instruction-sft | 61.02 | 61.69 |
|
70 |
+
| bilingual-gpt-neox-4b | 56.12 | 51.83 |
|
71 |
+
| japanese-gpt-neox-3.6b-instruction-ppo | 59.86 | 60.07 |
|
72 |
+
| japanese-gpt-neox-3.6b | 55.07 | 50.32 |
|
73 |
+
|
74 |
+
---
|
75 |
+
|
76 |
# I/O Format
|
77 |
A special format has been adopted to construct inputs.
|
78 |
* An input prompt is formatted as a conversation between `ユーザー` and `システム`.
|