rinna
/

bilingual-gpt-neox-4b-instruction-ppo

Text Generation

text-generation-inference

Model card Files Files and versions Community

tianyuz commited on Aug 2, 2023

Commit

f1e3d20

•

1 Parent(s): 86b9caf

Update README.md

Files changed (1) hide show

README.md +23 -0

README.md CHANGED Viewed

@@ -30,6 +30,13 @@ The model is based on [`rinna/bilingual-gpt-neox-4b-instruction-sft`](https://hu
     * The first SFT stage produces [`rinna/bilingual-gpt-neox-4b-instruction-sft`](https://huggingface.co/rinna/bilingual-gpt-neox-4b-instruction-sft).
     * The second RL stage produces this model.
 * **Model Series**
     | Variant | Link |
@@ -50,6 +57,22 @@ The model is based on [`rinna/bilingual-gpt-neox-4b-instruction-sft`](https://hu
 ---
 # I/O Format
 A special format has been adopted to construct inputs.
 * An input prompt is formatted as a conversation between `ユーザー` and `システム`.

     * The first SFT stage produces [`rinna/bilingual-gpt-neox-4b-instruction-sft`](https://huggingface.co/rinna/bilingual-gpt-neox-4b-instruction-sft).
     * The second RL stage produces this model.
+* **Reinforcement learning**
+    We used [CarperAI/trlx](https://github.com/CarperAI/trlx) and its implementation of the PPO algorithm for the RL stage.
+    The RL data is the subset of the following dataset and has been translated into Japanese.
+    * [Anthropic HH RLHF data](https://huggingface.co/datasets/Anthropic/hh-rlhf)
 * **Model Series**
     | Variant | Link |
 ---
+# Benchmarking
+  Our evaluation experiments suggest that the PPO does not particularly improve the model's performance on the Japanese LLM benchmark in comparison with [Bilingual GPT-NeoX 4B SFT](https://huggingface.co/rinna/bilingual-gpt-neox-4b-instruction-sft), but we have seen **better conversation experience** on the PPO model than its SFT counterpart.
+  - *The 4-task average accuracy is based on results of JCommonsenseQA, JNLI, MARC-ja, and JSQuAD.*
+  - *The 6-task average accuracy is based on results of JCommonsenseQA, JNLI, MARC-ja, JSQuAD, XWinograd, and JAQKET-v2.*
+  | Model | 4-task average accuracy | 6-task average accuracy |
+  | :-- | :-- | :-- |
+  | **bilingual-gpt-neox-4b-instruction-ppo** | **61.01** | **61.16** |
+  | bilingual-gpt-neox-4b-instruction-sft | 61.02 | 61.69 |
+  | bilingual-gpt-neox-4b | 56.12 | 51.83 |
+  | japanese-gpt-neox-3.6b-instruction-ppo | 59.86 | 60.07 |
+  | japanese-gpt-neox-3.6b | 55.07 | 50.32 |
+---
 # I/O Format
 A special format has been adopted to construct inputs.
 * An input prompt is formatted as a conversation between `ユーザー` and `システム`.