--- license: gpl-3.0 tags: - text2text-generation pipeline_tag: text2text-generation language: - zh - en --- Considering LLaMA's license constraints, the model is for research and learning only. Please strictly respect LLaMA's usage policy. We are not allowed to publish weights for LLaMA, of course, even finetuned, but there is no problem publishing the difference, a patch that we suggest to apply to the files. The encryption is a simple XOR between files, ensuring that only the people that have access to the original weights (from completely legal sources, of course) can transform them into finetuned weights. You can find the decrypt code on https://github.com/LianjiaTech/BELLE/tree/main/models . # Model Card for Model ID ## Welcome If you find this model helpful, please *like* this model and star us on https://github.com/LianjiaTech/BELLE ! ## Model description We release our model described in the paper [Towards Better Instruction Following Language Models for Chinese](https://github.com/LianjiaTech/BELLE/blob/main/docs/Towards%20Better%20Instruction%20Following%20Language%20Models%20for%20Chinese.pdf) This model achieves the best performance comparing other instruction-following models with a score of 0.762 on our evaluation set. ![Experimental results](main_results.png) ## Download, Convert & Check 1. After you git clone this model ``` md5sum ./* 29db882bdab3131ef05943ee8ba82e2c ./config.json.6375ff434583e14cfc1fd45f9f599ddb9c689cb9b8c542d427dc6d5dc1059037.enc f9b33d359f17a437f6c24b4de6f2272e ./generation_config.json.fd7ff399e5568cc21a0a8414f43df88ef7c424995b9b97a90563165d2cf79efd.enc 794e28fff16ef8c3fe9e48e3aa6ccf3a ./pytorch_model-00001-of-00002.bin.b552ebc4dd499812cfe1e45ffcaad0ee93851ef83df95eb4f824be53b25e5531.enc 1ab136a4489016c3004e3f04c438f268 ./pytorch_model-00002-of-00002.bin.45adb5c7b91f81b2c03c913f2e52487a0e22663e088063b699c6a903101b7968.enc 0d6db7f247a51589f3dd6d08dbfe64ce ./pytorch_model.bin.index.json.4f08b269e18619675bc3fd62f6efb3a8d59f9d54fa50f5625d0bba7adabaf90e.enc 34696bfce7b27548cfc2410e2b55762e ./special_tokens_map.json.96bdbb8504d9967606e5f661ccc7cbbac44a3661af863a7a58614670a0ccab33.enc 6014cf2235521f974c8d9fb69b6cf07e ./tokenizer_config.json.7078cc180b3d35e7ccd06b49ede4a7fef85f2572bda40c1fe2fc8f9ab25418d3.enc 56724a79091f3d1877cca65c6412d646 ./tokenizer.model.0b716a618c9e7c45648f91d997431eba3b0ff111b17ce7b777280ed771a49f95.enc ``` 2. Decrypt the files using the scripts in https://github.com/LianjiaTech/BELLE/tree/main/models You can use the following command in Bash. Please replace "/path/to_encrypted" with the path where you stored your encrypted file, replace "/path/to_original_llama_7B" with the path where you stored your original llama7B file, and replace "/path/to_finetuned_model" with the path where you want to save your final trained model. ```bash mkdir /path/to_finetuned_model for f in "/path/to_encrypted"/*; \ do if [ -f "$f" ]; then \ python3 decrypt.py "$f" "/path/to_original_llama_7B/consolidated.00.pth" "/path/to_finetuned_model/"; \ fi; \ done ``` After executing the aforementioned command, you will obtain the following files. ``` ./config.json ./generation_config.json ./pytorch_model-00001-of-00002.bin ./pytorch_model-00002-of-00002.bin ./pytorch_model.bin.index.json ./special_tokens_map.json ./tokenizer_config.json ./tokenizer.model ``` 3. Check md5sum You can verify the integrity of these files by performing an MD5 checksum to ensure their complete recovery. Here are the MD5 checksums for the relevant files: ``` md5sum ./* 139cb9dc0065bd878b277860c70add74 ./config.json 2917a1cafb895cf57e746cfd7696bfe5 ./generation_config.json 2f6cce3296b6bfeb8beb1629bf07dfe9 ./pytorch_model-00001-of-00002.bin 8fe5b4ad70788b3a6086ef28709a8730 ./pytorch_model-00002-of-00002.bin e5385004e4876ea6b93d6126e845a82f ./pytorch_model.bin.index.json 15f7a943faa91a794f38dd81a212cb01 ./special_tokens_map.json 08f6f621dba90b2a23c6f9f7af974621 ./tokenizer_config.json 6ffe559392973a92ea28032add2a8494 ./tokenizer.model ``` ## Use model Please note that the input should be formatted as follows in both **training** and **inference**. ``` python Human: {input} \n\nAssistant: ``` In order to load BELLE-LLAMA-7B-2M-enc with huggingface transformers, please install the main version, as the latest stable version doesn't support LLAMA (as of March 26, 2023). ``` python pip install git+https://github.com/huggingface/transformers ``` After you decrypt the files, BELLE-LLAMA-7B-2M can be easily loaded with LlamaForCausalLM. ``` python from transformers import LlamaForCausalLM, AutoTokenizer import torch ckpt = '/path/to_finetuned_model/' device = torch.device('cuda') model = LlamaForCausalLM.from_pretrained(ckpt, device_map='auto', low_cpu_mem_usage=True) tokenizer = AutoTokenizer.from_pretrained(ckpt) prompt = "Human: 写一首中文歌曲,赞美大自然 \n\nAssistant: " input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device) generate_ids = model.generate(input_ids, max_new_tokens=300, do_sample = True, top_k = 30, top_p = 0.85, temperature = 0.5,repetition_penalty=1.2, eos_token_id=2, bos_token_id=1, pad_token_id=0) output = tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] response = output[len(prompt):] print(response) ``` ## Limitations There still exists a few issues in the model trained on current base model and data: 1. The model might generate factual errors when asked to follow instructions related to facts. 2. Occasionally generates harmful responses since the model still struggles to identify potential harmful instructions. 3. Needs improvements on reasoning and coding. Since the model still has its limitations, we require developers only use the open-sourced code, data, model and any other artifacts generated via this project for research purposes. Commercial use and other potential harmful use cases are not allowed. ## Citation Please cite our paper and github when using our code, data or model. ``` @misc{ji2023better, title={Towards Better Instruction Following Language Models for Chinese: Investigating the Impact of Training Data and Evaluation}, author={Yunjie Ji and Yan Gong and Yong Deng and Yiping Peng and Qiang Niu and Baochang Ma and Xiangang Li}, year={2023}, eprint={2304.07854}, archivePrefix={arXiv}, primaryClass={cs.CL} } @misc{BELLE, author = {Yunjie Ji, Yong Deng, Yan Gong, Yiping Peng, Qiang Niu, Baochang Ma, Xiangang Li}, title = {BELLE: Be Everyone's Large Language model Engine}, year = {2023}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/LianjiaTech/BELLE}}, } ```