metadata
license: bigscience-bloom-rail-1.0
datasets:
- OpenAssistant/oasst1
- RyokoAI/ShareGPT52K
- Dahoas/full-hh-rlhf
- liswei/rm-static-m2m100-zh
- fnlp/moss-002-sft-data
language:
- zh
- en
This is an attempt to replicate the RLHF pipeline
Base Model
We used bloomz-7b1-mt because of its less-restricted license and multilingual ability.
Supervised Fintune
For SFT we used a combination of multiple datasets including:
- RyokoAI/ShareGPT52K
- GPTeacher
- Alpaca-GPT4 en & zh
- Filtered subset of machine-translated ShareGPT dataset into Chinese
Reward Model
For RM we used the code of reward-modeling repo and datasets from
Reinforcement Learning
For RL we used the code of trlx with slight modification.
Instead of building value network upon the policy network with a single linear layer, we add another hydra head upon the reference network's frozen bottom layers as value network.
Example
We used Vicuna v1.1 template for model training
from transformers import AutoModelForCausalLM, AutoTokenizer
checkpoint = "keyfan/bloomz-rlhf"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint).cuda()
template = ("A chat between a curious human and an artificial intelligence assistant. "
"The assistant gives helpful, detailed, and polite answers to the human's questions. "
"USER: {}\nASSISTANT:")
question = template.format("Who was the president of the United States in 1955?")
inputs = tokenizer.encode(question, return_tensors="pt").cuda()
outputs = model.generate(inputs, do_sample=True, top_p=0.8, max_new_tokens=512)
print(tokenizer.decode(outputs[0]))
Evalutions
Result on the Chinese BELLE eval set
others | rewrite | classification | generation | summarization | extract | open qa | brainstorming | closed qa | macro ave | macro ave w/o others |
---|---|---|---|---|---|---|---|---|---|---|
0.619 | 0.873 | 0.706 | 0.934 | 0.755 | 0.619 | 0.527 | 0.908 | 0.615 | 0.728 | 0.742 |
- We found in GPT-4 evaluation the order in which the responses were presented has unneglectable affect on the final score even with the very-well designed Vicuna prompt. So we removed the score on the Vicuna eval set.