File size: 7,563 Bytes
f4f8a7b
af12685
 
 
f4f8a7b
 
2fb5b65
 
 
f4f8a7b
2fb5b65
bda845d
 
8e71494
 
f4f8a7b
de9f376
f4f8a7b
af12685
f4f8a7b
f9cdf36
f4f8a7b
a0449e9
f4f8a7b
af12685
 
 
 
f4f8a7b
af12685
f4f8a7b
af12685
f4f8a7b
af12685
 
 
 
 
f4f8a7b
 
af12685
f4f8a7b
ab9a6cd
 
 
f4f8a7b
 
af12685
 
1b39a0f
f4f8a7b
 
af12685
f4f8a7b
7e512de
 
 
 
f4f8a7b
af12685
f4f8a7b
af12685
 
7e512de
af12685
f4f8a7b
af12685
 
 
f4f8a7b
af12685
 
2649554
a0449e9
f4f8a7b
af12685
8a04a84
af12685
f4f8a7b
2649554
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
af12685
f4f8a7b
af12685
 
 
 
 
 
f4f8a7b
af12685
 
 
 
f4f8a7b
2649554
af12685
 
f4f8a7b
af12685
 
 
f4f8a7b
af12685
 
 
 
 
 
f4f8a7b
af12685
 
 
f4f8a7b
8a04a84
 
1b39a0f
8e71494
8a04a84
af12685
f4f8a7b
af12685
8e6fc4a
 
 
 
 
 
 
af12685
8e71494
8e6fc4a
8e71494
 
 
 
1b39a0f
8e71494
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
---
language:
- en
pipeline_tag: text-classification
---

<p align="center">
    <img src="./Bespoke-Labs-Logo.png" width="550">
</p>

# Llama-3.1-Bespoke-MiniCheck-7B
[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1s-5TYnGV3kGFMLp798r5N-FXPD8lt2dm?usp=sharing)

This is a fact-checking model developed by [Bespoke Labs](https://bespokelabs.ai) and maintained by [Liyan Tang](https://www.tangliyan.com/) and Bespoke Labs. 
The model is an improvement of the MiniCheck model proposed in the following paper:

📃 [**MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents**](https://arxiv.org/pdf/2404.10774.pdf), EMNLP 2024

[GitHub Repo](https://github.com/Liyan06/MiniCheck)

The model takes as input a document and **a sentence** and determines whether the sentence is supported by the document: **MiniCheck-Model(document, claim) -> {0, 1}**

**In order to fact-check a multi-sentence claim, the claim should first be broken up into sentences.** The document does not need to be chunked unless it exceeds `32K` tokens. Depending on use cases, adjusting chunk size may yield better performance.

`Llama-3.1-Bespoke-MiniCheck-7B` is finetuned from `internlm/internlm2_5-7b-chat` ([Cai et al., 2024](https://arxiv.org/pdf/2403.17297))
on the combination of 35K data points only:
- 21K ANLI examples ([Nie et al., 2020](https://aclanthology.org/2020.acl-main.441.pdf))
- 14K synthetically-generated examples following the scheme in the MiniCheck paper, but with additional proprietary data curation techniques (sampling, selecting additional high quality data sources, etc.) from Bespoke Labs. Specifically, we generate 7K "claim-to-document" (C2D) and 7K "doc-to-claim" (D2C) examples. The following steps were taken to avoid benchmark contamination: the error types of the model in the benchmark data were not used, and the data sources were curated independent of the benchmark.

All synthetic data is generated by [`meta-llama/Meta-Llama-3.1-405B-Instruct`](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct), thus the name `Llama-3.1-Bespoke-MiniCheck-7B`.

**While scaling up the model (compared to what is in MiniCheck) helped, many improvements come from high-quality curation, thus establishing the superiority of Bespoke Labs's curation technology.**

### Model Variants
We also have other three MiniCheck model variants:
- [lytang/MiniCheck-Flan-T5-Large](https://huggingface.co/lytang/MiniCheck-Flan-T5-Large) (Model Size: 0.8B)
- [lytang/MiniCheck-RoBERTa-Large](https://huggingface.co/lytang/MiniCheck-RoBERTa-Large) (Model Size: 0.4B)
- [lytang/MiniCheck-DeBERTa-v3-Large](https://huggingface.co/lytang/MiniCheck-DeBERTa-v3-Large) (Model Size: 0.4B)


### Model Performance

<p align="center">
    <img src="./performance.png" width="550">
</p>


The performance of these models is evaluated on our new collected benchmark (unseen by our models during training), [LLM-AggreFact](https://huggingface.co/datasets/lytang/LLM-AggreFact),
from 11 recent human annotated datasets on fact-checking and grounding LLM generations. **Llama-3.1-Bespoke-MiniCheck-7B is the SOTA
fact-checking model despite its small size.**


# Model Usage

Please run the following command to install the **MiniCheck package** and all necessary dependencies.
```sh
pip install "minicheck[llm] @ git+https://github.com/Liyan06/MiniCheck.git@main"
```

### Below is a simple use case

```python
from minicheck.minicheck import MiniCheck
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

doc = "A group of students gather in the school library to study for their upcoming final exams."
claim_1 = "The students are preparing for an examination."
claim_2 = "The students are on vacation."

# model_name can be one of:
# ['roberta-large', 'deberta-v3-large', 'flan-t5-large', 'Bespoke-MiniCheck-7B']
scorer = MiniCheck(model_name='Bespoke-MiniCheck-7B', enable_prefix_caching=False, cache_dir='./ckpts')
pred_label, raw_prob, _, _ = scorer.score(docs=[doc, doc], claims=[claim_1, claim_2]) # can set `chunk_size=your-specified-value` here, default to 32K chunk size. 

print(pred_label) # [1, 0]
print(raw_prob)   # [0.9840446675150499, 0.010986349594852094]
```

### Throughput

We speed up Llama-3.1-Bespoke-MiniCheck-7B inference with [vLLM](https://github.com/vllm-project/vllm). Based on our test on 
a single A6000 (48 VRAM), Llama-3.1-Bespoke-MiniCheck-7B with vLLM and MiniCheck-Flan-T5-Large have throughputs > 500 docs/min.

### Automatic Prefix Caching

> Automatic Prefix Caching (APC in short) caches the KV cache of existing queries, so that a new query can directly reuse the KV
> cache if it shares the same prefix with one of the existing queries, allowing the new query to skip the computation of the shared part.

To enable automatic prefix caching for `Bespoke-MiniCheck-7B`, simply set `enable_prefix_caching=True` when initializing the 
MiniCheck model (no other changes are needed):

```python
scorer = MiniCheck(model_name='Bespoke-MiniCheck-7B', enable_prefix_caching=True, cache_dir='./ckpts')
```

How automatic prefix caching affects the throughput and model performance can be found in the [GitHub Repo](https://github.com/Liyan06/MiniCheck).



### Test on our [LLM-AggreFact](https://huggingface.co/datasets/lytang/LLM-AggreFact) Benchmark

```python
import pandas as pd
from datasets import load_dataset
from minicheck.minicheck import MiniCheck
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

# load 29K test data
df = pd.DataFrame(load_dataset("lytang/LLM-AggreFact")['test'])
docs = df.doc.values
claims = df.claim.values

scorer = MiniCheck(model_name='Bespoke-MiniCheck-7B', enable_prefix_caching=False, cache_dir='./ckpts')
pred_label, raw_prob, _, _ = scorer.score(docs=[doc, doc], claims=[claim_1, claim_2])  # ~ 500 docs/min, depending on hardware
```

To evaluate the result on the benchmark
```python
from sklearn.metrics import balanced_accuracy_score

df['preds'] = pred_label
result_df = pd.DataFrame(columns=['Dataset', 'BAcc'])
for dataset in df.dataset.unique():
    sub_df = df[df.dataset == dataset]
    bacc = balanced_accuracy_score(sub_df.label, sub_df.preds) * 100
    result_df.loc[len(result_df)] = [dataset, bacc]

result_df.loc[len(result_df)] = ['Average', result_df.BAcc.mean()]
result_df.round(1)
```

# License

This work is licensed under [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/).
For commercial licensing, please contact company@bespokelabs.ai.

# Citation

```
@InProceedings{tang-etal-2024-minicheck,
  title = {MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents},
  author = {Liyan Tang and Philippe Laban and Greg Durrett},
  booktitle = {Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing},
  year = {2024},
  publisher = {Association for Computational Linguistics},
  url = {https://arxiv.org/pdf/2404.10774}
}


@misc{tang2024bespokeminicheck,
  title={Bespoke-Minicheck-7B},
  author={Bespoke Labs},
  year={2024},
  url={https://huggingface.co/bespokelabs/Bespoke-MiniCheck-7B},
}
```

# Acknowledgements

Model perfected at [Bespoke Labs](https://www.bespokelabs.ai).

Team:
1. [Liyan Tang](https://tangliyan.com/)
2. [Negin Raoof](https://neginraoof.com/)
3. [Trung Vu](https://x.com/trungthvu)
4. [Greg Durrett](https://www.cs.utexas.edu/~gdurrett/)
5. [Alex Dimakis](https://users.ece.utexas.edu/~dimakis/)
6. [Mahesh Sathiamoorthy](https://smahesh.com)

We also thank Giannis Daras for feedback and Sarthak Malhotra for market research.