zjunlp
/

zhixi-13b-diff

@@ -10,27 +10,30 @@ tags:
 <p align="center" width="100%">
-<a href="" target="_blank"><img src="https://github.com/zjunlp/CaMA/blob/main/assets/logo.jpg?raw=true" alt="ZJU-CaMA" style="width: 30%; min-width: 30px; display: block; margin: auto;"></a>
 </p>
-> This is the result of the weight difference between `Llama 13B` and `CaMA-13B`. You can click [here](https://github.com/zjunlp/cama) to learn more.
-# CaMA: A Chinese-English Bilingual LLaMA Model
-With the birth of ChatGPT, artificial intelligence has also entered the "iPhone moment," where various large language models (LLMs) have sprung up like mushrooms. The wave of these large models has quickly swept through artificial intelligence fields beyond natural language processing. However, training such a model requires extremely high hardware costs, and open-source language models are scarce due to various reasons, making Chinese language models even more scarce. It wasn't until the open-sourcing of LLaMA that a variety of language models based on LLaMA started to emerge. This project is also based on the LLaMA model. To further enhance Chinese language capabilities without compromising its original language distribution, we first <b>(1) perform additional pre-training on LLaMA (13B) using Chinese corpora, aiming to improve the model's Chinese comprehension and knowledge base while preserving its original English and code abilities to the greatest extent possible;</b> then, <b>(2) we fine-tune the model from the first step using an instruction dataset to enhance the language model's understanding of human instructions.</b>
 **The features of this project are as follows:**
-- We conducted full pre-training on LLaMA using the Chinese pre-training corpus we built, which improved the model's understanding of Chinese.
-- We utilized our Chinese instruction dataset, consisting of approximately 1.4 million samples, and performed LoRA fine-tuning to enhance the model's comprehension of human instructions.
-- We optimized the Information Extraction (IE) tasks, including Named Entity Recognition (NER), Relation Extraction (RE), and Event Extraction (EE), by utilizing human instructions to accomplish information extraction tasks.
-- We have open-sourced the weights of the pre-trained model and the LoRA weights used for instruction fine-tuning.
-- We have also made the full pre-training script available, which includes transformations, construction, and loading of large-scale corpora, as well as the LoRA instruction fine-tuning script.
-All weights have been uploaded to Hugging Face. The CaMA differential weights can be found [here](https://huggingface.co/zjunlp/CaMA-13B-Diff), and the LoRA weights can be found [here](https://huggingface.co/zjunlp/CaMA-13B-LoRA).
 ## Contents
@@ -206,10 +209,14 @@ Our pre-trained model has demonstrated certain abilities in instruction followin
 The effectiveness of information extraction is illustrated in the following figure. We tested different instructions for different tasks as well as the same instructions for the same task, and achieved good results for all of them.
 <p align="center" width="100%">
-<a href="" target="_blank"><img src="https://github.com/zjunlp/CaMA/blob/main/assets/ie-case.jpg?raw=true" alt="IE" style="width: 60%; min-width: 60px; display: block; margin: auto;"></a>
 </p>
 <h3 id="1-3">1.3 General Ablities Cases</h3>
@@ -363,8 +370,8 @@ The effectiveness of information extraction is illustrated in the following figu
 <h3 id="2-1">2.1 Environment Configuration</h3>
 ```shell
-conda create -n cama python=3.9 -y
-conda activate cama
 pip install torch==1.12.0+cu116 torchvision==0.13.0+cu116 torchaudio==0.12.0 --extra-index-url https://download.pytorch.org/whl/cu116
 pip install -r requirements.txt
 ```
@@ -372,9 +379,9 @@ pip install -r requirements.txt
 <h3 id="2-2">2.2 Pretraining model weight acquisition and restoration</h3>
-> Since the Meta has not fully released the weights of LLaMA, we have computed the difference between the CaMA weights and the LLaMA weights and uploaded them [here](https://huggingface.co/zjunlp/CaMA-13B-Diff). To restore the complete CaMA weights, please follow the steps outlined below.
-**1. Download LLaMA 13B and CaMA-13B-Diff**
 Please click [here](https://forms.gle/jk851eBVbX1m5TAv5) to apply for the official pre-training weights of LLaMA from `meta`. In this case, we are using the `13B` version of the model, so you only need to download the `13B` version. Once downloaded, the file directory will be as follows:
@@ -389,10 +396,16 @@ Please click [here](https://forms.gle/jk851eBVbX1m5TAv5) to apply for the offici
 |-- tokenizer_checklist.chk
 ```
-You can use the following command to download the `CaMA-diff` file (assuming it is saved in the `./CaMA-Diff` folder):
 ```shell
-python tools/download.py --download_path ./CaMA-Diff --only_base
 ```
 > :exclamation:Noted. If the download is interrupted, please repeat the command mentioned above. HuggingFace provides the functionality of resumable downloads, allowing you to resume the download from where it was interrupted.
 **2. Use the conversion script provided by huggingface**
@@ -403,17 +416,23 @@ To convert the original LLaMA-13B model into the HuggingFace format, you can use
 python convert_llama_weights_to_hf.py --input_dir ./ --model_size 13B --output_dir ./converted
 ```
-**3. Restore CaMA 13B**
-Use the script we provided, located at `./tools/weight_diff.py`, execute the following command, and you will get the complete `CaMA` weight:
 ```shell
-python tools/weight_diff.py recover --path_raw ./converted --path_diff ./CaMA-Diff --path_tuned ./CaMA
 ```
-The final complete CaMA weights are saved in the `./CaMA` folder.
 <h3 id="2-3">2.3 Instruction tuning LoRA weight acquisition</h3>
@@ -431,26 +450,28 @@ The final complete weights are saved in the `./LoRA` folder.
 **1. Reproduce the results in Section 1**
-1. If you want to reproduce the results in section `1.1`(**pretraining cases**), please run the following command (assuming that the complete pre-training weights of `CaMA` have been obtained according to the steps in section `2.2`, and the CaMA weight is saved in the `./CaMA` folder):
    ```shell
-   python examples/generate_finetune.py --base_model ./CaMA
    ```
    The result in section `1.1` can be obtained.
-2. If you want to reproduce the results in section `1.2`(**information extraction cases**), please run the following command (assuming that the LoRA weights of `CaMA` have been obtained according to the steps in section `2.3`, and the LoRA weights is saved in the `./LoRA` folder):
    ```shell
-   python examples/generate_lora.py --load_8bit --base_model ./CaMA --lora_weights ./LoRA --run_ie_cases
    ```
    The result in section `1.2` can be obtained.
-3. If you want to reproduce the results in section `1.3`(**general ablities cases**), please run the following command (assuming that the LoRA weights of `CaMA` have been obtained according to the steps in section `2.3`, and the LoRA weights is saved in the `./LoRA` folder):
    ```shell
-   python examples/generate_lora.py --load_8bit --base_model ./CaMA --lora_weights ./LoRA --run_general_cases
    ```
    The result in section `1.3` can be obtained.
@@ -464,7 +485,7 @@ We offer two methods: the first one is **command-line interaction**, and the sec
 1. Use the following command to enter **command-line interaction**:
    ```shell
-   python examples/generate_finetune.py --base_model ./CaMA --interactive
    ```
    The disadvantage is the inability to dynamically change decoding parameters.
@@ -472,24 +493,25 @@ We offer two methods: the first one is **command-line interaction**, and the sec
 2. Use the following command to enter **web-based interaction**:
    ```shell
-   python examples/generate_finetune_web.py --base_model ./CaMA
    ```
    Here is a screenshot of the web-based interaction:
    <p align="center" width="100%">
-   <a href="" target="_blank"><img src="https://github.com/zjunlp/CaMA/blob/main/assets/finetune_web.jpg?raw=true" alt="finetune-web" style="width: 100%; min-width: 100px; display: block; margin: auto;"></a>
    </p>
 **3. Usage of Instruction tuning Model**
 Here, we provide a web-based interaction method. Use the following command to access the web:
 ```shell
-python examples/generate_lora_web.py --base_model ./CaMA --lora_weights ./LoRA
 ```
 Here is a screenshot of the web-based interaction:
 <p align="center" width="100%">
-<a href="" target="_blank"><img src="https://github.com/zjunlp/CaMA/blob/main/assets/lora_web.png?raw=true" alt="finetune-web" style="width: 100%; min-width: 100px; display: block; margin: auto;"></a>
 </p>
 The `instruction` is a required parameter, while `input` is an optional parameter. For general tasks (such as the examples provided in section `1.3`), you can directly enter the input in the `instruction` field. For information extraction tasks (as shown in the example in section `1.2`), please enter the instruction in the `instruction` field and the sentence to be extracted in the `input` field. We provide an information extraction prompt in section `2.5`.
@@ -500,8 +522,9 @@ If you want to perform batch testing, please modify the `examples/generate_lora.
 <h3 id="2-5">2.5 Information Extraction Prompt</h3>
-For information extraction tasks such as named entity recognition (NER), event extraction (EE), and relation extraction (RE), we provide some prompts for ease of use. You can refer to this [link](./examples/ie_prompt.py) for examples. Of course, you can also try using your own prompts.
 <h2 id="3">3. Training Details</h2>
@@ -512,7 +535,7 @@ For information extraction tasks such as named entity recognition (NER), event e
 >
 >  (2) Instruction tuning stage using LoRA. This stage enables the model to understand human instructions and generate appropriate responses.
-![](https://github.com/zjunlp/CaMA/blob/main/assets/main.jpg?raw=true)
 <h3 id="3-1">3.1 Dataset Construction (Pretraining)</h3>
@@ -522,7 +545,7 @@ For the crawled datasets mentioned above, we employed a heuristic approach to fi
 <h3 id="3-2">3.2 Training Process (Pretraining)</h3>
-Detailed data processing code, training code, complete training scripts, and detailed training results can be found in [./pretrain](./pretrain).
 Before training, we need to tokenize the data. We set the maximum length of a single sample to `1024`, while most documents are much longer than this. Therefore, we need to partition these documents. **We designed a greedy algorithm to split the documents, with the goal of ensuring that each sample consists of complete sentences and minimizing the number of segments while maximizing the length of each sample.** Additionally, due to the diversity of data sources, we developed a comprehensive data preprocessing tool that can process and merge data from various sources. Finally, considering the large amount of data, loading it directly into memory would impose excessive hardware pressure. Therefore, we referred to [DeepSpeed-Megatron](https://github.com/bigscience-workshop/Megatron-DeepSpeed/tree/main/tools) and used the `mmap` method to process and load the data. This involves loading the indices into memory and accessing the corresponding data on disk when needed.
@@ -554,7 +577,10 @@ In addition, we manually constructed a general Chinese dataset and translated it
 | Information Extraction Datasets (English)   |   537429   |
 | Information Extraction Datasets (Chinese)   |   486768   |
 <h3 id="3-4">3.4 Training Process (Instruction tuning)</h3>

 <p align="center" width="100%">
+<a href="" target="_blank"><img src="https://github.com/zjunlp/KnowLM/blob/main/assets/zx.png?raw=true" alt="ZJU-KnowLM" style="width: 30%; min-width: 30px; display: block; margin: auto;"></a>
 </p>
+> This is the result of the weight difference between `Llama 13B` and `ZhiXi-13B`. You can click [here](https://github.com/zjunlp/KnowLM) to learn more.
+# Knowledgable Large Language Model Framework.
+With the rapid development of deep learning technology, large language models such as ChatGPT have made substantial strides in the realm of natural language processing. However, these expansive models still encounter several challenges in acquiring and comprehending knowledge, including the difficulty of updating knowledge and potential knowledge discrepancies and biases, collectively known as knowledge fallacies. The KnowLM project endeavors to tackle these issues by launching an open-source large-scale knowledgable language model framework and releasing corresponding models.
+The project's `initial phase` introduced a knowledge extraction LLM based on LLaMA, dubbed **ZhiXi (智析)**. To integrate the capacity of Chinese understanding into the language models without compromising their inherent knowledge, we firstly <b>(1) use Chinese corpora for the full-scale pre-training with LLaMA (13B), augment the language model's understanding of Chinese and improve its knowledge richness while retaining its original English and code capacities;</b> Then <b>(2) we fine-tune the model obtained from the first step with an instruction dataset, thus bolstering the language model's understanding of human instructions for knowledge extraction.</b>
+- ❗Please note that this project is still undergoing optimization, and the model weights will be regularly updated to support new features and models!
 **The features of this project are as follows:**
+- Centered on knowledge and large models, a **full-scale pre-training** of the large model, such as LLaMA, is conducted using the built Chinese&English pre-training corpus.
+- Based on the technology of **KG2Instructions**, the knowledge extraction tasks, including NER, RE, and IE, are optimized and can be completed using human instructions.
+- Using the built Chinese instruction dataset (approximately 1400K), LoRA fine-tuning is used to enhance the model's understanding of human instructions.
+- The weights of the pre-training model and LoRA's instruction fine-tuning are open-sourced.
+- The **full-scale pre-training code** (providing conversion, construction, and loading of large corpora) and **LoRA instruction fine-tuning code** are open-sourced (support multi-machine multi-GPU).
+All weights have been uploaded to Hugging Face. The ZhiXi differential weights can be found [here](https://huggingface.co/zjunlp/zhixi-13B-Diff), and the LoRA weights can be found [here](https://huggingface.co/zjunlp/zhixi-13B-LoRA).
 ## Contents
 The effectiveness of information extraction is illustrated in the following figure. We tested different instructions for different tasks as well as the same instructions for the same task, and achieved good results for all of them.
 <p align="center" width="100%">
+<a href="" target="_blank"><img src="https://github.com/zjunlp/KnowLM/blob/main/assets/ie-case-new_logo-en.png?raw=true" alt="IE" style="width: 60%; min-width: 60px; display: block; margin: auto;"></a>
 </p>
+Compared to other large models like ChatGPT, as shown in the graph, it can be observed that our model achieves more accurate and comprehensive extraction results. However, we have also identified some extraction errors in ZhiXi. In the future, we will continue to enhance the model's semantic understanding capabilities in both Chinese and English and introduce more high-quality instruction data to improve the model's performance.
+<p align="center" width="100%">
+<a href="" target="_blank"><img src="https://github.com/zjunlp/KnowLM/blob/main/assets/casevschatgpt.png?raw=true" alt="IE-cases-vs-chatgpt"></a>
+</p>
 <h3 id="1-3">1.3 General Ablities Cases</h3>
 <h3 id="2-1">2.1 Environment Configuration</h3>
 ```shell
+conda create -n zhixi python=3.9 -y
+conda activate zhixi
 pip install torch==1.12.0+cu116 torchvision==0.13.0+cu116 torchaudio==0.12.0 --extra-index-url https://download.pytorch.org/whl/cu116
 pip install -r requirements.txt
 ```
 <h3 id="2-2">2.2 Pretraining model weight acquisition and restoration</h3>
+❗❗❗ Note that in terms of hardware, performing step `2.2`, which involves merging LLaMA-13B with ZhiXI-13B-Diff, requires approximately **100GB** of RAM, with no demand for VRAM (this is due to the memory overhead caused by our merging strategy. To facilitate usage, we will improve our merging approach in future updates, and we are currently developing a 7B model as well, so stay tuned). For step `2.4`, which involves inference using `ZhiXi`, a minimum of **26GB** of VRAM is required.
+**1. Download LLaMA 13B and ZhiXi-13B-Diff**
 Please click [here](https://forms.gle/jk851eBVbX1m5TAv5) to apply for the official pre-training weights of LLaMA from `meta`. In this case, we are using the `13B` version of the model, so you only need to download the `13B` version. Once downloaded, the file directory will be as follows:
 |-- tokenizer_checklist.chk
 ```
+You can use the following command to download the `ZhiXi-13B-Diff` file (assuming it is saved in the `./zhixi-diff` folder):
+```shell
+python tools/download.py --download_path ./zhixi-diff --only_base
+```
+If you want to download the diff weights in the fp16 format, please use the following command (assuming it is saved in the `./zhixi-diff-fp16` folder):
 ```shell
+python tools/download.py --download_path ./zhixi-diff-fp16 --only_base --fp16
 ```
 > :exclamation:Noted. If the download is interrupted, please repeat the command mentioned above. HuggingFace provides the functionality of resumable downloads, allowing you to resume the download from where it was interrupted.
 **2. Use the conversion script provided by huggingface**
 python convert_llama_weights_to_hf.py --input_dir ./ --model_size 13B --output_dir ./converted
 ```
+**3. Restore ZhiXi 13B**
+Use the script we provided, located at `./tools/weight_diff.py`, execute the following command, and you will get the complete `ZhiXi` weight:
+```shell
+python tools/weight_diff.py recover --path_raw ./converted --path_diff ./zhixi-diff --path_tuned ./zhixi
+```
+The final complete ZhiXi weights are saved in the `./zhixi` folder.
+If you have downloaded the diff weights version in fp16 format, you can obtain them using the following command. Please note that there might be slight differences compared to the weights obtained in fp32 format:
 ```shell
+python tools/weight_diff.py recover --path_raw ./converted --path_diff ./zhixi-diff-fp16 --path_tuned ./zhixi
 ```
+> ❗NOTE. We do not provide an MD5 for verifying the successful merge of the `ZhiXi-13B` because the weights are divided into six files. We employ the same validation strategy as [Stanford Alpaca](https://github.com/tatsu-lab/stanford_alpaca), which involves performing a sum check on the weights (you can refer to this [link](https://github.com/zjunlp/KnowLLM/blob/main/tools/weight_diff.py#L108)). **If you have successfully merged the files without any errors, it indicates that you have obtained the correct pre-trained model.**
 <h3 id="2-3">2.3 Instruction tuning LoRA weight acquisition</h3>
 **1. Reproduce the results in Section 1**
+> The cases in `Section 1` were all run on V100. If running on other devices, the results may vary. Please run multiple times or change the decoding parameters.
+1. If you want to reproduce the results in section `1.1`(**pretraining cases**), please run the following command (assuming that the complete pre-training weights of `ZhiXi` have been obtained according to the steps in section `2.2`, and the ZhiXi weight is saved in the `./zhixi` folder):
    ```shell
+   python examples/generate_finetune.py --base_model ./zhixi
    ```
    The result in section `1.1` can be obtained.
+2. If you want to reproduce the results in section `1.2`(**information extraction cases**), please run the following command (assuming that the LoRA weights of `ZhiXi` have been obtained according to the steps in section `2.3`, and the LoRA weights is saved in the `./lora` folder):
    ```shell
+   python examples/generate_lora.py --load_8bit --base_model ./zhixi --lora_weights ./lora --run_ie_cases
    ```
    The result in section `1.2` can be obtained.
+3. If you want to reproduce the results in section `1.3`(**general ablities cases**), please run the following command (assuming that the LoRA weights of `ZhiXi` have been obtained according to the steps in section `2.3`, and the LoRA weights is saved in the `./lora` folder):
    ```shell
+   python examples/generate_lora.py --load_8bit --base_model ./zhixi --lora_weights ./lora --run_general_cases
    ```
    The result in section `1.3` can be obtained.
 1. Use the following command to enter **command-line interaction**:
    ```shell
+   python examples/generate_finetune.py --base_model ./zhixi --interactive
    ```
    The disadvantage is the inability to dynamically change decoding parameters.
 2. Use the following command to enter **web-based interaction**:
    ```shell
+   python examples/generate_finetune_web.py --base_model ./zhixi
    ```
    Here is a screenshot of the web-based interaction:
    <p align="center" width="100%">
+   <a href="" target="_blank"><img src="https://github.com/zjunlp/KnowLM/blob/main/assets/finetune_web.jpg?raw=true" alt="finetune-web" style="width: 100%; min-width: 100px; display: block; margin: auto;"></a>
    </p>
 **3. Usage of Instruction tuning Model**
 Here, we provide a web-based interaction method. Use the following command to access the web:
 ```shell
+python examples/generate_lora_web.py --base_model ./zhixi --lora_weights ./lora
 ```
 Here is a screenshot of the web-based interaction:
 <p align="center" width="100%">
+<a href="" target="_blank"><img src="https://github.com/zjunlp/KnowLM/blob/main/assets/lora_web.png?raw=true" alt="finetune-web" style="width: 100%; min-width: 100px; display: block; margin: auto;"></a>
 </p>
 The `instruction` is a required parameter, while `input` is an optional parameter. For general tasks (such as the examples provided in section `1.3`), you can directly enter the input in the `instruction` field. For information extraction tasks (as shown in the example in section `1.2`), please enter the instruction in the `instruction` field and the sentence to be extracted in the `input` field. We provide an information extraction prompt in section `2.5`.
 <h3 id="2-5">2.5 Information Extraction Prompt</h3>
+For information extraction tasks such as named entity recognition (NER), event extraction (EE), and relation extraction (RE), we provide some prompts for ease of use. You can refer to this [link](https://github.com/zjunlp/KnowLM/blob/main/examples/ie_prompt.py) for examples. Of course, you can also try using your own prompts.
+Here is a [case](https://github.com/zjunlp/DeepKE/blob/main/example/llm/InstructKGC/README.md) where ZhiXi-13B-LoRA is used to accomplish the instruction-based knowledge graph construction task in CCKS2023.
 <h2 id="3">3. Training Details</h2>
 >
 >  (2) Instruction tuning stage using LoRA. This stage enables the model to understand human instructions and generate appropriate responses.
+![](https://github.com/zjunlp/KnowLM/blob/main/assets/main_new.jpg?raw=true)
 <h3 id="3-1">3.1 Dataset Construction (Pretraining)</h3>
 <h3 id="3-2">3.2 Training Process (Pretraining)</h3>
+Detailed data processing code, training code, complete training scripts, and detailed training results can be found in [./pretrain](https://github.com/zjunlp/KnowLM/blob/main/pretrain).
 Before training, we need to tokenize the data. We set the maximum length of a single sample to `1024`, while most documents are much longer than this. Therefore, we need to partition these documents. **We designed a greedy algorithm to split the documents, with the goal of ensuring that each sample consists of complete sentences and minimizing the number of segments while maximizing the length of each sample.** Additionally, due to the diversity of data sources, we developed a comprehensive data preprocessing tool that can process and merge data from various sources. Finally, considering the large amount of data, loading it directly into memory would impose excessive hardware pressure. Therefore, we referred to [DeepSpeed-Megatron](https://github.com/bigscience-workshop/Megatron-DeepSpeed/tree/main/tools) and used the `mmap` method to process and load the data. This involves loading the indices into memory and accessing the corresponding data on disk when needed.
 | Information Extraction Datasets (English)   |   537429   |
 | Information Extraction Datasets (Chinese)   |   486768   |
+**Flow diagram of KG2Instruction and other instruction fine-tuning datasets**
+<p align="center" width="100%">
+<a href="" target="_blank"><img src="https://github.com/zjunlp/KnowLM/blob/main/assets/kg2instructions-en.png?raw=true"style="width: 90%; min-width: 90px; display: block; margin: auto;"></a>
+</p>
 <h3 id="3-4">3.4 Training Process (Instruction tuning)</h3>