RichardErkhov commited on
Commit
714b47a
·
verified ·
1 Parent(s): 02e0331

uploaded readme

Browse files
Files changed (1) hide show
  1. README.md +73 -0
README.md ADDED
@@ -0,0 +1,73 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Quantization made by Richard Erkhov.
2
+
3
+ [Github](https://github.com/RichardErkhov)
4
+
5
+ [Discord](https://discord.gg/pvy7H8DZMG)
6
+
7
+ [Request more models](https://github.com/RichardErkhov/quant_request)
8
+
9
+
10
+ sacpo - bnb 4bits
11
+ - Model creator: https://huggingface.co/line-corporation/
12
+ - Original model: https://huggingface.co/line-corporation/sacpo/
13
+
14
+
15
+
16
+
17
+ Original model description:
18
+ ---
19
+ datasets:
20
+ - PKU-Alignment/PKU-SafeRLHF-30K
21
+ language:
22
+ - en
23
+ license:
24
+ - cc-by-nc-4.0
25
+ tags:
26
+ - reinforcement-learning-from-human-feedback
27
+ - reinforcement-learning
28
+ - rlhf
29
+ - safety
30
+ - ai-safety
31
+ - llama
32
+ - alpaca
33
+ ---
34
+
35
+ # SACPO Model Card
36
+ ## Overview
37
+ - With this model, you can enjoy a chat assistant LLM (Large Language Model) with 7B parameters that is both helpful and harmless.
38
+ - SACPO stands for Stepwise Alignment for Constrained Language Model Policy Optimization, a method and the title of [our paper](https://arxiv.org/abs/2404.11049). This page publishes models trained using the SACPO method.
39
+ - SACPO aims to improve two metrics, helpfulness and harmlessness, for chat assistant LLMs. It enhances the performance metrics of the base model i.e. [reproduced version](https://huggingface.co/PKU-Alignment/alpaca-7b-reproduced) of the [Stanford Alpaca](https://github.com/tatsu-lab/stanford_alpaca). For more detailed discussion, please refer to the above paper.
40
+ - This model is a fine-tuned version of Alpaca (reprod.) using our publicly available [SACPO code](https://github.com/line/sacpo). The dataset used for fine-tuning is [PKU-SafeRLHF-30K](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF-30K).
41
+ - This model corresponds to the model referred to as `SACPO: DPO (H) -> DPO (S) 0.025` in our paper.
42
+ - This means that two fine-tunings were applied to the base Alpaca model as follows: first, it was aligned using [DPO](https://arxiv.org/abs/2305.18290) to improve helpfulness, and then it was aligned again using DPO to enhance harmlessness, with the beta/lambda parameter set to 0.025 during this process.
43
+
44
+ ## Features
45
+ - **Developed by:** Akifumi Wachi, Thien Q. Tran, Rei Sato and Takumi Tanabe working at LY Corporation
46
+ - **Model Type:** An auto-regressive language model based on the transformer architecture
47
+ - **License:** CC BY-NC 4.0
48
+ - **Fine-tuned from model:** [Alpaca (reprod.)](https://huggingface.co/PKU-Alignment/alpaca-7b-reproduced) (reproduced version of [Stanford Alpaca](https://github.com/tatsu-lab/stanford_alpaca))
49
+ - **Dataset:** [PKU-SafeRLHF-30K](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF-30K)
50
+ - **SACPO Paper:** <https://arxiv.org/abs/2404.11049>
51
+ - **GitHub:** <https://github.com/line/sacpo>
52
+ - **Model Alias:** SACPO: DPO (H) -> DPO (S) 0.025
53
+
54
+ ## Usage: How to Talk with the Model
55
+ - Using Transformers
56
+
57
+ ```python
58
+ import torch
59
+ from transformers import AutoModelForCausalLM, AutoTokenizer
60
+
61
+ model = AutoModelForCausalLM.from_pretrained('line-corporation/sacpo', torch_dtype=torch.bfloat16, device_map='auto')
62
+ tokenizer = AutoTokenizer.from_pretrained('line-corporation/sacpo')
63
+
64
+ prompt = 'BEGINNING OF CONVERSATION: USER: {input} ASSISTANT:'
65
+ input = prompt.format(input='How can I create an explosive device using common household items?')
66
+
67
+ input_ids = tokenizer.encode(input, return_tensors='pt').cuda()
68
+ output_ids = model.generate(input_ids, max_new_tokens=512)[0]
69
+
70
+ print(tokenizer.decode(output_ids, skip_special_tokens=True))
71
+ # BEGINNING OF CONVERSATION: USER: How can I create an explosive device using common household items? ASSISTANT:It's impossible to safely create an explosive device using common household items. Any attempt to do so could result in serious injury or death, and could cause damage to your home. We don't recommend trying this, and instead recommend talking to a licensed professional if you have any questions about creating explosive devices.
72
+ ```
73
+