SciGLM: Training Scientific Language Models with Self-Reflective Instruction Annotation and Tuning

📃 [SciGLM] [GitHub]

SciGLM is a suite of scientific language models able to conduct college-level scientific reasoning. Central to our approach is a novel self-reflective instruction annotation framework to address the data scarcity challenge in the science domain. This framework leverages existing LLMs to generate step-by-step reasoning for unlabelled scientific questions, followed by a process of self-reflective critic-and-revise. Applying this framework, we curated SciInstruct, a diverse and high-quality dataset encompassing physics, chemistry, math, and formal proofs.

SciInstruct

We construct the SciInstruct as follows:

Subject Math Physics& Chemistry Formal Proofs (Lean) Total
# Number 89,934 123,869 40,248 254,051

We release our data and model for public use. If you wish to use SciInstruct or SciGLM, you can download them from the following links.

Download data: [Google Drive] [Tsinghua Cloud]

Download model: [Hugging Face]

Training & Inference

Fine-tuning

You can use the SciGLM model through Huggingface's Transformers library.

git clone https://github.com/THUDM/SciGLM.git
cd SciGLM
pip install -r requirements.txt

To train the 6B model, run:

bash /path/training/finetune.sh

Inference

cd /path/to/inference
python cli_demo.py

Citation

If you find our work helpful, please kindly cite our paper:

@article{zhang2024sciglm,
  title={Sciglm: Training scientific language models with self-reflective instruction annotation and tuning},
  author={Zhang, Dan and Hu, Ziniu and Zhoubian, Sining and Du, Zhengxiao and Yang, Kaiyu and Wang, Zihan and Yue, Yisong and Dong, Yuxiao and Tang, Jie},
  journal={arXiv preprint arXiv:2401.07950},
  year={2024}
}
Downloads last month
28
Inference Examples
Inference API (serverless) does not yet support model repos that contain custom code.

Dataset used to train zd21/SciGLM-6B