--- license: apache-2.0 --- # Taiyi (太一): A Bilingual (Chinese and English) Fine-Tuned Large Language Model for Diverse Biomedical Tasks [Demo](https://modelscope.cn/studios/DUTIRbionlp/Taiyi-demo/) | [Github](https://github.com/DUTIR-BioNLP/Taiyi-LLM) | [paper](https://academic.oup.com/jamia/article/31/9/1865/7616487?utm_source=authortollfreelink&utm_campaign=jamia&utm_medium=email&guestAccessKey=4c56c223-a555-4949-bef7-16e77f8baa10) | [Data](https://huggingface.co/datasets/DUTIR-BioNLP/Taiyi_Instruction_Data_001) This is the model of Taiyi using Qwen-7b-base as the base model, developed by [DUTIR](http://ir.dlut.edu.cn/) lab. ## Project Background With the rapid development of deep learning technology, large language models like ChatGPT have made significant progress in the field of natural language processing. In the context of biomedical applications, large language models facilitate communication between healthcare professionals and patients, provide valuable medical information, and have enormous potential in assisting diagnosis, biomedical knowledge discovery, drug development, and personalized healthcare solutions, among others. However, in the AI community, there is a relative scarcity of existing open-source biomedical large models, with most of them primarily focused on monolingual medical question-answering dialogues in either Chinese or English. Therefore, this project embarks on research dedicated to large models for the biomedical domain and introduces the initial version of a bilingual Chinese-English biomedical large model named 'Taiyi', iming to explore the capabilities of large models in handling a variety of Chinese-English natural language processing tasks in the biomedical field. **Project Highlights** - **Abundant Biomedical Training Resources**:For the biomedical domain, this project has collected and organized a diverse set of Chinese-English biomedical Natural Language Processing (BioNLP) training datasets. This collection includes a total of 38 Chinese datasets covering 10 BioNLP tasks and 131 English datasets covering 12 BioNLP tasks. To facilitate task-specific requirements, standardized data formats have been designed and applied for consistent formatting across all datasets. - **Exceptional Bilingual BioNLP Multi-Task Capability in Chinese and English**:Designing and constructing a bilingual Chinese-English instruction dataset (comprising over 1 million samples) for large model fine-tuning, enabling the model to excel in various BioNLP tasks including intelligent biomedical question-answering, doctor-patient dialogues, report generation, information extraction, machine translation, headline generation, text classification, and more. - **Open Source Information**:Open-source Chinese-English BioNLP dataset curation details, Taiyi large model weights, and model inference deployment scripts. ## Model Inference We concatenate multi-turn dialogues into the following format, and then tokenize them. Where eod is the special character <|endoftext|> in the qwen tokenizer. ``` input1answer1input2answer2..... ``` The following code can be used to perform inference using our model: ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch model_name = "DUTIR-BioNLP/Taiyi-LLM" device = 'cuda:0' model = AutoModelForCausalLM.from_pretrained( model_name, low_cpu_mem_usage=True, torch_dtype=torch.float16, trust_remote_code=True, device_map = device ) model.eval() tokenizer = AutoTokenizer.from_pretrained( model_name, trust_remote_code=True ) import logging logging.disable(logging.WARNING) tokenizer.pad_token_id = tokenizer.eod_id tokenizer.bos_token_id = tokenizer.eod_id tokenizer.eos_token_id = tokenizer.eod_id history_token_ids = torch.tensor([[]], dtype=torch.long) max_new_tokens = 500 top_p = 0.9 temperature = 0.3 repetition_penalty = 1.0 # begin chat history_max_len = 1000 utterance_id = 0 history_token_ids = None user_input = "Hi, could you please introduce yourself?" input_ids = tokenizer(user_input, return_tensors="pt", add_special_tokens=False).input_ids bos_token_id = torch.tensor([[tokenizer.bos_token_id]], dtype=torch.long) eos_token_id = torch.tensor([[tokenizer.eos_token_id]], dtype=torch.long) user_input_ids = torch.concat([bos_token_id,input_ids, eos_token_id], dim=1) model_input_ids = user_input_ids.to(device) with torch.no_grad(): outputs = model.generate( input_ids=model_input_ids, max_new_tokens=max_new_tokens, do_sample=True, top_p=top_p, temperature=temperature, repetition_penalty=repetition_penalty, eos_token_id=tokenizer.eos_token_id ) response = tokenizer.batch_decode(outputs) print(response[0]) #<|endoftext|>Hi, could you please introduce yourself?<|endoftext|>Hello! My name is Taiyi,.....<|endoftext|> ``` We provide two test codes for dialogue. You can use the code in [dialogue_one_trun.py](https://github.com/DUTIR-BioNLP/Taiyi-LLM/blob/main/dialogue_one_trun.py) to test single-turn QA dialogue, or use the sample code in [dialogue_multi_trun.py](https://github.com/DUTIR-BioNLP/Taiyi-LLM/blob/main/dialogue_one_trun.py) to test multi-turn conversational QA. ## Citation If you use the repository of this project, please cite it. ``` @article{Taiyi, title="{Taiyi: A Bilingual Fine-Tuned Large Language Model for Diverse Biomedical Tasks}", author={Ling Luo, Jinzhong Ning, Yingwen Zhao, Zhijun Wang, Zeyuan Ding, Peng Chen, Weiru Fu, Qinyu Han, Guangtao Xu, Yunzhi Qiu, Dinghao Pan, Jiru Li, Hao Li, Wenduo Feng, Senbo Tu, Yuqi Liu, Zhihao Yang, Jian Wang, Yuanyuan Sun, Hongfei Lin}, journal={Journal of the American Medical Informatics Association}, year={2024}, doi = {10.1093/jamia/ocae037}, url = {https://doi.org/10.1093/jamia/ocae037}, } ```