|
--- |
|
license: gpl-3.0 |
|
tags: |
|
- text2text-generation |
|
pipeline_tag: text2text-generation |
|
language: |
|
- zh |
|
- en |
|
--- |
|
|
|
Considering LLaMA's license constraints, the model is for research and learning only. |
|
Please strictly respect LLaMA's usage policy. We are not allowed to publish weights for LLaMA, of course, even finetuned, but there is no problem publishing the difference, a patch that we suggest to apply to the files. |
|
The encryption is a simple XOR between files, ensuring that only the people that have access to the original weights (from completely legal sources, of course) can transform them into finetuned weights. |
|
You can find the decrypt code on https://github.com/LianjiaTech/BELLE/tree/main/models . |
|
|
|
|
|
# Model Card for Model ID |
|
|
|
## Welcome |
|
If you find this model helpful, please *like* this model and star us on https://github.com/LianjiaTech/BELLE ! |
|
|
|
## Update |
|
A new checkpoint trained with learning rate of 5e-6 is uploaded. |
|
In our evaluation, llama trained with smaller lr achieved better performance. |
|
|
|
## Model description |
|
BELLE-LLAMA-7B-2M-enc is based on LLAMA 7B and finetuned with 2M Chinese data combined with 50,000 pieces of English data from the open source Stanford-Alpaca, resulting in good Chinese instruction understanding and response generation capabilities. |
|
|
|
The code of Chinese data generation and other detailed information can be found in our Github project repository: https://github.com/LianjiaTech/BELLE. |
|
|
|
|
|
## Training hyper-parameters |
|
| Parameter | Value | |
|
| ------ | ------ | |
|
| Batch size | 16 | |
|
| Learning rate | 5e-6 | |
|
| Epochs | 3 | |
|
|Weight_decay | 0.0 | |
|
|Warmup_rate | 0.03 | |
|
|LR_scheduler | cosine | |
|
|
|
## Download, Convert & Check |
|
1. After you git clone this model |
|
``` |
|
md5sum ./* |
|
45afa71e3067de5119233a57ef9d093d ./config.json.99a4ef2a26cb38c7f684cb83ed9343f660c561dd5a02a97d1b34b47419324dc5.enc |
|
f9b33d359f17a437f6c24b4de6f2272e ./generation_config.json.fd7ff399e5568cc21a0a8414f43df88ef7c424995b9b97a90563165d2cf79efd.enc |
|
172013287b452114abf5c0e64936f45b ./pytorch_model-00001-of-00002.bin.166879223b7504f1632d72b1577d57bceaa8fdeee1857c61119e575c50a4aae5.enc |
|
384f8dc3b6da063c5f7554c52c531c44 ./pytorch_model-00002-of-00002.bin.2319db050dc286cb22c6e08a51a4ec0d9377017a7182a20a12c39eb658f39c80.enc |
|
2ac1e5262eefd012918724d68813d03e ./pytorch_model.bin.index.json.f56e69fedde5d28e4f37f2b62f74e8522bbfa13395a6d696d1ef99222a431ab7.enc |
|
c066b68b4139328e87a694020fc3a6c3 ./special_tokens_map.json.ca3d163bab055381827226140568f3bef7eaac187cebd76878e0b63e9e442356.enc |
|
2d5d4156fd237fceae85f28d06751020 ./tokenizer_config.json.a672113277a674d753b5cdcfa6bfc860dc69bfcc5511bdccb0c6af3ed08873a0.enc |
|
39ec1b33fbf9a0934a8ae0f9a24c7163 ./tokenizer.model.9e556afd44213b6bd1be2b850ebbbd98f5481437a8021afaf58ee7fb1818d347.enc |
|
``` |
|
|
|
2. Decrypt the files using the scripts in https://github.com/LianjiaTech/BELLE/tree/main/models |
|
|
|
You can use the following command in Bash. |
|
Please replace "/path/to_encrypted" with the path where you stored your encrypted file, |
|
replace "/path/to_original_llama_7B" with the path where you stored your original llama7B file, |
|
and replace "/path/to_finetuned_model" with the path where you want to save your final trained model. |
|
|
|
```bash |
|
mkdir /path/to_finetuned_model |
|
for f in "/path/to_encrypted"/*; \ |
|
do if [ -f "$f" ]; then \ |
|
python3 decrypt.py "$f" "/path/to_original_llama_7B/consolidated.00.pth" "/path/to_finetuned_model/"; \ |
|
fi; \ |
|
done |
|
``` |
|
|
|
After executing the aforementioned command, you will obtain the following files. |
|
|
|
``` |
|
./config.json |
|
./generation_config.json |
|
./pytorch_model-00001-of-00002.bin |
|
./pytorch_model-00002-of-00002.bin |
|
./pytorch_model.bin.index.json |
|
./special_tokens_map.json |
|
./tokenizer_config.json |
|
./tokenizer.model |
|
``` |
|
|
|
3. Check md5sum |
|
|
|
You can verify the integrity of these files by performing an MD5 checksum to ensure their complete recovery. |
|
Here are the MD5 checksums for the relevant files: |
|
``` |
|
md5sum ./* |
|
a57bf2d0d7ec2590740bc4175262610b ./config.json |
|
2917a1cafb895cf57e746cfd7696bfe5 ./generation_config.json |
|
252143e5ed0f0073dc5c04159a0f78c2 ./pytorch_model-00001-of-00002.bin |
|
3f71478bd783685f0a45fc742af85042 ./pytorch_model-00002-of-00002.bin |
|
d5230ae5fb3bfd12df98af123be53cf5 ./pytorch_model.bin.index.json |
|
8a80554c91d9fca8acb82f023de02f11 ./special_tokens_map.json |
|
414f52220807d1300ad700283141de69 ./tokenizer_config.json |
|
eeec4125e9c7560836b4873b6f8e3025 ./tokenizer.model |
|
``` |
|
|
|
## Use model |
|
Please note that the input should be formatted as follows in both **training** and **inference**. |
|
``` python |
|
Human: {input} \n\nAssistant: |
|
``` |
|
|
|
In order to load BELLE-LLAMA-7B-2M-enc with huggingface transformers, please install the main version, as the latest stable version doesn't support LLAMA (as of March 26, 2023). |
|
``` python |
|
pip install git+https://github.com/huggingface/transformers |
|
``` |
|
|
|
After you decrypt the files, BELLE-LLAMA-7B-2M can be easily loaded with LlamaForCausalLM. |
|
``` python |
|
from transformers import LlamaForCausalLM, AutoTokenizer |
|
import torch |
|
|
|
ckpt = '/path/to_finetuned_model/' |
|
device = torch.device('cuda') |
|
model = LlamaForCausalLM.from_pretrained(ckpt, device_map='auto', low_cpu_mem_usage=True) |
|
tokenizer = AutoTokenizer.from_pretrained(ckpt) |
|
prompt = "Human: 写一首中文歌曲,赞美大自然 \n\nAssistant: " |
|
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device) |
|
generate_ids = model.generate(input_ids, max_new_tokens=500, do_sample = True, top_k = 30, top_p = 0.85, temperature = 0.5, repetition_penalty=1., eos_token_id=2, bos_token_id=1, pad_token_id=0) |
|
output = tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] |
|
response = output[len(prompt):] |
|
|
|
``` |
|
|
|
## Limitations |
|
There still exists a few issues in the model trained on current base model and data: |
|
|
|
1. The model might generate factual errors when asked to follow instructions related to facts. |
|
|
|
2. Occasionally generates harmful responses since the model still struggles to identify potential harmful instructions. |
|
|
|
3. Needs improvements on reasoning and coding. |
|
|
|
Since the model still has its limitations, we require developers only use the open-sourced code, data, model and any other artifacts generated via this project for research purposes. Commercial use and other potential harmful use cases are not allowed. |
|
|
|
|
|
## Citation |
|
|
|
Please cite us when using our code, data or model. |
|
|
|
``` |
|
@misc{BELLE, |
|
author = {Yunjie Ji, Yong Deng, Yan Gong, Yiping Peng, Qiang Niu, Baochang Ma, Xiangang Li}, |
|
title = {BELLE: Be Everyone's Large Language model Engine}, |
|
year = {2023}, |
|
publisher = {GitHub}, |
|
journal = {GitHub repository}, |
|
howpublished = {\url{https://github.com/LianjiaTech/BELLE}}, |
|
} |
|
``` |