--- language: - en pipeline_tag: text-classification --- # Llama-3.1-Bespoke-MiniCheck-7B This is a fact-checking model developed by [Bespoke Labs](https://bespokelabs.ai) and maintained by [Liyan Tang](https://www.tangliyan.com/) and Bespoke Labs. The model is an improvement of the MiniCheck model proposed in the following paper: 📃 [**MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents**](https://arxiv.org/pdf/2404.10774.pdf) [GitHub Repo](https://github.com/Liyan06/MiniCheck) The model takes as input a document and a sentence and determines whether the sentence is supported by the document: **MiniCheck-Model(document, claim) -> {0, 1}** In order to fact-check a multi-sentence claim, the claim should first be broken up into sentences. The document does not need to be chunked unless it exceeds `32K` tokens. `Llama-3.1-Bespoke-MiniCheck-7B` is finetuned from `internlm/internlm2_5-7b-chat` ([Cai et al., 2024](https://arxiv.org/pdf/2403.17297)) on the combination of 35K data points only: - 21K ANLI examples ([Nie et al., 2020](https://aclanthology.org/2020.acl-main.441.pdf)) - 14K synthetically-generated examples following the scheme in the MiniCheck paper, but with additional proprietary data curation techniques (sampling, selecting additional high quality data sources, etc.) from Bespoke Labs. Specifically, we generate 7K "claim-to-document" (C2D) and 7K "doc-to-claim" (D2C) examples. The following steps were taken to avoid benchmark contamination: the error types of the model in the benchmark data were not used, and the data sources were curated independent of the benchmark. All synthetic data is generated by [`meta-llama/Meta-Llama-3.1-405B-Instruct`](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct), thus the name `Llama-3.1-Bespoke-MiniCheck-7B`. **While scaling up the model (compared to what is in MiniCheck) helped, many improvements come from high-quality curation, thus establishing the superiority of Bespoke Labs's curation technology.** ### Model Variants We also have other three MiniCheck model variants: - [lytang/MiniCheck-Flan-T5-Large](https://huggingface.co/lytang/MiniCheck-Flan-T5-Large) (Model Size: 0.8B) - [lytang/MiniCheck-RoBERTa-Large](https://huggingface.co/lytang/MiniCheck-RoBERTa-Large) (Model Size: 0.4B) - [lytang/MiniCheck-DeBERTa-v3-Large](https://huggingface.co/lytang/MiniCheck-DeBERTa-v3-Large) (Model Size: 0.4B) ### Model Performance
The performance of these models is evaluated on our new collected benchmark (unseen by our models during training), [LLM-AggreFact](https://huggingface.co/datasets/lytang/LLM-AggreFact), from 11 recent human annotated datasets on fact-checking and grounding LLM generations. **Llama-3.1-Bespoke-MiniCheck-7B is the SOTA fact-checking model, despite a small size.** # Model Usage Please first clone our [GitHub Repo](https://github.com/Liyan06/MiniCheck) and install necessary packages from `requirements.txt`. ### Below is a simple use case ```python from minicheck.minicheck import MiniCheck os.environ["CUDA_VISIBLE_DEVICES"] = "0" doc = "A group of students gather in the school library to study for their upcoming final exams." claim_1 = "The students are preparing for an examination." claim_2 = "The students are on vacation." # model_name can be one of: # ['roberta-large', 'deberta-v3-large', 'flan-t5-large', 'Bespoke-MiniCheck-7B'] scorer = MiniCheck(model_name='Bespoke-MiniCheck-7B', enable_prefix_caching=False, cache_dir='./ckpts') pred_label, raw_prob, _, _ = scorer.score(docs=[doc, doc], claims=[claim_1, claim_2]) print(pred_label) # [1, 0] print(raw_prob) # [0.9840446675150499, 0.010986349594852094] ``` ### Throughput We speed up Llama-3.1-Bespoke-MiniCheck-7B inference with [vLLM](https://github.com/vllm-project/vllm). Based on our test on a single A6000 (48 VRAM), Llama-3.1-Bespoke-MiniCheck-7B with vLLM and MiniCheck-Flan-T5-Large have throughputs > 500 docs/min. ### Automatic Prefix Caching > Automatic Prefix Caching (APC in short) caches the KV cache of existing queries, so that a new query can directly reuse the KV > cache if it shares the same prefix with one of the existing queries, allowing the new query to skip the computation of the shared part. To enable automatic prefix caching for `Bespoke-MiniCheck-7B`, simply set `enable_prefix_caching=True` when initializing the MiniCheck model (no other changes are needed): ```python scorer = MiniCheck(model_name='Bespoke-MiniCheck-7B', enable_prefix_caching=True, cache_dir='./ckpts') ``` How automatic prefix caching affects the throughput and model performance can be found in the [GitHub Repo](https://github.com/Liyan06/MiniCheck). ### Test on our [LLM-AggreFact](https://huggingface.co/datasets/lytang/LLM-AggreFact) Benchmark ```python import pandas as pd from datasets import load_dataset from minicheck.minicheck import MiniCheck import os os.environ["CUDA_VISIBLE_DEVICES"] = "0" # load 29K test data df = pd.DataFrame(load_dataset("lytang/LLM-AggreFact")['test']) docs = df.doc.values claims = df.claim.values scorer = MiniCheck(model_name='Bespoke-MiniCheck-7B', enable_prefix_caching=False, cache_dir='./ckpts') pred_label, raw_prob, _, _ = scorer.score(docs=[doc, doc], claims=[claim_1, claim_2]) # ~ 500 docs/min, depending on hardware ``` To evaluate the result on the benchmark ```python from sklearn.metrics import balanced_accuracy_score df['preds'] = pred_label result_df = pd.DataFrame(columns=['Dataset', 'BAcc']) for dataset in df.dataset.unique(): sub_df = df[df.dataset == dataset] bacc = balanced_accuracy_score(sub_df.label, sub_df.preds) * 100 result_df.loc[len(result_df)] = [dataset, bacc] result_df.loc[len(result_df)] = ['Average', result_df.BAcc.mean()] result_df.round(1) ``` # License Free for use for non-commercial purposes. For commercial licensing, please contact company@bespokelabs.ai. # Citation ``` @misc{tang2024minicheck, title={MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents}, author={Liyan Tang and Philippe Laban and Greg Durrett}, year={2024}, eprint={2404.10774}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```