Text Generation
Transformers
PyTorch
llama
Eval Results
text-generation-inference
Inference Endpoints
File size: 6,591 Bytes
ee3928c
8993d83
452d29b
 
 
cd5873c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ee3928c
833aeeb
bb0f400
66a1c60
13094c8
66a1c60
833aeeb
8993d83
738abf4
3ffbf1b
8993d83
 
738abf4
8993d83
 
 
89de5d4
17fc37d
89de5d4
17fc37d
738abf4
8993d83
89de5d4
17fc37d
89de5d4
17fc37d
89de5d4
17fc37d
 
 
89de5d4
17fc37d
738abf4
17fc37d
 
 
833aeeb
8d9bf15
738abf4
8993d83
837b74b
 
 
 
 
 
 
 
 
 
 
 
 
cd5873c
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
---
license: apache-2.0
datasets:
- anon8231489123/ShareGPT_Vicuna_unfiltered
- declare-lab/HarmfulQA
model-index:
- name: starling-7B
  results:
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: AI2 Reasoning Challenge (25-Shot)
      type: ai2_arc
      config: ARC-Challenge
      split: test
      args:
        num_few_shot: 25
    metrics:
    - type: acc_norm
      value: 51.02
      name: normalized accuracy
    source:
      url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=declare-lab/starling-7B
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: HellaSwag (10-Shot)
      type: hellaswag
      split: validation
      args:
        num_few_shot: 10
    metrics:
    - type: acc_norm
      value: 76.77
      name: normalized accuracy
    source:
      url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=declare-lab/starling-7B
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: MMLU (5-Shot)
      type: cais/mmlu
      config: all
      split: test
      args:
        num_few_shot: 5
    metrics:
    - type: acc
      value: 47.75
      name: accuracy
    source:
      url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=declare-lab/starling-7B
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: TruthfulQA (0-shot)
      type: truthful_qa
      config: multiple_choice
      split: validation
      args:
        num_few_shot: 0
    metrics:
    - type: mc2
      value: 48.18
    source:
      url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=declare-lab/starling-7B
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: Winogrande (5-shot)
      type: winogrande
      config: winogrande_xl
      split: validation
      args:
        num_few_shot: 5
    metrics:
    - type: acc
      value: 70.56
      name: accuracy
    source:
      url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=declare-lab/starling-7B
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: GSM8k (5-shot)
      type: gsm8k
      config: main
      split: test
      args:
        num_few_shot: 5
    metrics:
    - type: acc
      value: 10.08
      name: accuracy
    source:
      url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=declare-lab/starling-7B
      name: Open LLM Leaderboard
---
[**Paper**](https://arxiv.org/abs/2308.09662) | [**Github**](https://github.com/declare-lab/red-instruct) | [**Dataset**](https://huggingface.co/datasets/declare-lab/HarmfulQA)| [**Model**](https://huggingface.co/declare-lab/starling-7B)


> 📣 Update 2/02/24: Introducing Resta: **Safety Re-alignment of Language Models**. [**Paper**](https://arxiv.org/abs/2402.11746) [**Github**](https://github.com/declare-lab/resta) [**Dataset**](https://huggingface.co/datasets/declare-lab/CategoricalHarmfulQ)

As a part of our research efforts to make LLMs safer, we created **Starling**. It is obtained by fine-tuning Vicuna-7B on [**HarmfulQA**](https://huggingface.co/datasets/declare-lab/HarmfulQA), a ChatGPT-distilled dataset that we collected using the Chain of Utterances (CoU) prompt. More details are in our paper [**Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment**](https://arxiv.org/abs/2308.09662)

<img src="https://declare-lab.github.io/assets/images/logos/starling-final.png" alt="Image" width="100" height="100">

Experimental results on several safety benchmark datasets indicate that **Starling** is a safer model compared to the baseline model, Vicuna.

<img src="https://declare-lab.github.io/assets/images/logos/method.png" alt="Image" width="1000" height="335">

<h2>Experimental Results</h2>

Compared to Vicuna, **Avg. 5.2% reduction in Attack Success Rate** (ASR) on DangerousQA and HarmfulQA using three different prompts.**

Compared to Vicuna, **Avg. 3-7% improvement in HHH score** measured on BBH-HHH benchmark.**

<img src="https://declare-lab.github.io/assets/images/logos/starling-results.png" alt="Image" width="1000" height="335">

TruthfulQA (MC2): **48.90 vs Vicuna's 47.00**

MMLU (5-shot): **46.69 vs Vicuna's 47.18**

BBH (3-shot): **33.47 vs Vicuna's 33.05**

<h2>Jailbreak Prompt for harmfulness eval using Red Eval as reported in the paper</h2>

This jailbreak prompt (termed as Chain of Utterances (CoU) prompt in the paper) shows a 65% Attack Success Rate (ASR) on GPT-4 and 72% on ChatGPT.

<img src="https://declare-lab.github.io/assets/images/logos/jailbreakprompt_main_paper.png" alt="Image" width="1000" height="1000">

<h2>HarmfulQA Data Collection</h2>

We also release our **HarmfulQA** dataset with 1,960 harmful questions (converting 10 topics-10 subtopics) for red-teaming as well as conversations based on them used in model safety alignment, more details [**here**](https://huggingface.co/datasets/declare-lab/HarmfulQA). The following figure describes the data collection process.

<img src="https://declare-lab.github.io/assets/images/logos/data_gen.png" alt="Image" width="1000" height="1000">

_Note: This model is referred to as Starling (Blue) in the paper. We shall soon release Starling (Blue-Red) which was trained on harmful data using an objective function that helps the model learn from the red (harmful) response data._

## Citation

```bibtex
@misc{bhardwaj2023redteaming,
      title={Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment}, 
      author={Rishabh Bhardwaj and Soujanya Poria},
      year={2023},
      eprint={2308.09662},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```
# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_declare-lab__starling-7B)

|             Metric              |Value|
|---------------------------------|----:|
|Avg.                             |50.73|
|AI2 Reasoning Challenge (25-Shot)|51.02|
|HellaSwag (10-Shot)              |76.77|
|MMLU (5-Shot)                    |47.75|
|TruthfulQA (0-shot)              |48.18|
|Winogrande (5-shot)              |70.56|
|GSM8k (5-shot)                   |10.08|