|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- Anthropic/hh-rlhf |
|
language: |
|
- en |
|
library_name: transformers |
|
pipeline_tag: text-generation |
|
tags: |
|
- rlhf |
|
- alignment |
|
- simulation |
|
- computational social science |
|
--- |
|
|
|
|
|
# Model Card for HH-RLHF-SFT |
|
|
|
![model image](https://agwarbliu.s3.amazonaws.com/logo.png) |
|
|
|
![model image](https://agwarbliu.s3.amazonaws.com/model_select_sft.png) |
|
|
|
|
|
**Efficient, Effective, and Stable alternative of RLHF!** |
|
|
|
**Instead of training an additional reward model that is likely to be gamed, we directly train the model on the social games!** 🕹️ 🎲 🎮 |
|
|
|
Full details on simulation and training can be found [here](https://github.com/agi-templar/Stable-Alignment). |
|
|
|
# Training Procedure |
|
|
|
This is the second step of Stable Alignment project, which is a supervised fine-tuned model on [Anthropic HH-RLHF dataset](https://huggingface.co/datasets/Anthropic/hh-rlhf) (only on the 'accepted' options). |
|
|
|
We use the [Alpaca fine-tuning script](https://github.com/tatsu-lab/stanford_alpaca) to train this model. |
|
|
|
|
|
# Bias, Risks, and Limitations |
|
|
|
Although this project aims to better align current LMs with social norms, inappropriate content and inherent biases in the training data will still impair the alignment of the model. |
|
|
|
The model should not be used directly in any application, without a prior assessment of safety and fairness concerns specific to the application. |
|
|
|
# Citation |
|
|
|
Please cite our paper if you use the data or code in this repo: |
|
|
|
```bibtex |
|
@misc{liu2023sociallyaligned, |
|
title={Training Socially Aligned Language Models in Simulated Human Society}, |
|
author={Ruibo Liu and Ruixin Yang and Chenyan Jia and Ge Zhang and Denny Zhou and Andrew M. Dai and Diyi Yang and Soroush Vosoughi}, |
|
year={2023}, |
|
eprint={2305.16960}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL} |
|
} |
|
``` |