--- license: apache-2.0 pipeline_tag: text-generation --- [**🌐 Homepage**]() | [**🤗 Model**]() | [**📖 arXiv**]() | [**GitHub**]() # Introduction CT-LLM-SFT-DPO is an alignment version of [CT-LLM](https://huggingface.co/m-a-p/CT-LLM-Base). The main features of this model is: 1. Our model, an alignment-enhanced variant of CT-LLM-SFT, is trained using DPO, a direct preference-based learning method. 2. We utilize a combination of publicly available datasets and synthetic data to train our model. 3. Our model outperforms a range of 2B LLMs on the Cvalues benchmark, demonstrating its enhanced harmless nature. The Alignment training also enhance its general performances. Specifically, it shows enhanced results on benchmarks like COPA, CMMLU, Hellaswag, and TriviaQA compared to the CT-LLM-SFT version. # Training Data Our model incorporates a blend of publicly accessible datasets and synthetic data from the LLM. The open-source Chinese datasets consist of non-harmful and beneficial sections from [cvalues_rlhf](https://huggingface.co/datasets/Skepsun/cvalues_rlhf), `comparison_gpt4_data_zh` and `oaast_rm_zh` in Llama-factory, [huozi](https://github.com/HIT-SCIR/huozi), and [zhihu](https://huggingface.co/datasets/liyucheng/zhihu_rlhf_3k). For English, the dataset includes `comparison_gpt4_data_en` from [Llama-factory](https://github.com/hiyouga/LLaMA-Factory) and [beavertails](https://github.com/PKU-Alignment/beavertails). To construct a more high-quality preference dataset via a synthetics approach, we adopt alpaca-gpt4 which generates "chosen" responses using GPT-4, and we adopt [baichuan-6B](https://huggingface.co/baichuan-inc/Baichuan-7B) serving as a weaker model for generating "reject" responses. The dataset comprises 183k Chinese pairs and 46k English pairs in total. # Training Settings We leverage the CT-LLM-SFT as a reference model $\pi_{sft}$ to optimize the objective language model $\pi_{\theta}$. $\pi_{\theta}$ is initialized by the model parameters of the $\pi_{sft}$. We set the hyperparameters as follows: 1. The $\pi_{\theta}$ is trained on 8 H800, 2. learning rate $=1e-6$, 3. batch size $=4$, 4. epoch numbers $=2$, 5. weight decay $=0.1$, 6. warmup ratio $=0.03$, 7. $\beta=0.5$ to control the deviation from $\pi_{sft}$. # Results ## Performance on CValues ![Alt text](safe.png) ## Performance on General Benchmark ![Alt text](general.png)