wangxh07 commited on
Commit
c30ffa3
1 Parent(s): 3e83f2b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +63 -37
README.md CHANGED
@@ -10,27 +10,30 @@ tags:
10
 
11
 
12
  <p align="center" width="100%">
13
- <a href="" target="_blank"><img src="https://github.com/zjunlp/CaMA/blob/main/assets/logo.jpg?raw=true" alt="ZJU-CaMA" style="width: 30%; min-width: 30px; display: block; margin: auto;"></a>
14
  </p>
15
 
16
 
17
- > This is the result of the weight difference between `Llama 13B` and `CaMA-13B`. You can click [here](https://github.com/zjunlp/cama) to learn more.
18
 
19
 
20
- # CaMA: A Chinese-English Bilingual LLaMA Model
21
 
22
- With the birth of ChatGPT, artificial intelligence has also entered the "iPhone moment," where various large language models (LLMs) have sprung up like mushrooms. The wave of these large models has quickly swept through artificial intelligence fields beyond natural language processing. However, training such a model requires extremely high hardware costs, and open-source language models are scarce due to various reasons, making Chinese language models even more scarce. It wasn't until the open-sourcing of LLaMA that a variety of language models based on LLaMA started to emerge. This project is also based on the LLaMA model. To further enhance Chinese language capabilities without compromising its original language distribution, we first <b>(1) perform additional pre-training on LLaMA (13B) using Chinese corpora, aiming to improve the model's Chinese comprehension and knowledge base while preserving its original English and code abilities to the greatest extent possible;</b> then, <b>(2) we fine-tune the model from the first step using an instruction dataset to enhance the language model's understanding of human instructions.</b>
 
 
 
23
 
24
  **The features of this project are as follows:**
25
 
26
- - We conducted full pre-training on LLaMA using the Chinese pre-training corpus we built, which improved the model's understanding of Chinese.
27
- - We utilized our Chinese instruction dataset, consisting of approximately 1.4 million samples, and performed LoRA fine-tuning to enhance the model's comprehension of human instructions.
28
- - We optimized the Information Extraction (IE) tasks, including Named Entity Recognition (NER), Relation Extraction (RE), and Event Extraction (EE), by utilizing human instructions to accomplish information extraction tasks.
29
- - We have open-sourced the weights of the pre-trained model and the LoRA weights used for instruction fine-tuning.
30
- - We have also made the full pre-training script available, which includes transformations, construction, and loading of large-scale corpora, as well as the LoRA instruction fine-tuning script.
31
 
32
 
33
- All weights have been uploaded to Hugging Face. The CaMA differential weights can be found [here](https://huggingface.co/zjunlp/CaMA-13B-Diff), and the LoRA weights can be found [here](https://huggingface.co/zjunlp/CaMA-13B-LoRA).
34
 
35
  ## Contents
36
 
@@ -206,10 +209,14 @@ Our pre-trained model has demonstrated certain abilities in instruction followin
206
  The effectiveness of information extraction is illustrated in the following figure. We tested different instructions for different tasks as well as the same instructions for the same task, and achieved good results for all of them.
207
 
208
  <p align="center" width="100%">
209
- <a href="" target="_blank"><img src="https://github.com/zjunlp/CaMA/blob/main/assets/ie-case.jpg?raw=true" alt="IE" style="width: 60%; min-width: 60px; display: block; margin: auto;"></a>
210
  </p>
211
 
 
212
 
 
 
 
213
 
214
  <h3 id="1-3">1.3 General Ablities Cases</h3>
215
 
@@ -363,8 +370,8 @@ The effectiveness of information extraction is illustrated in the following figu
363
  <h3 id="2-1">2.1 Environment Configuration</h3>
364
 
365
  ```shell
366
- conda create -n cama python=3.9 -y
367
- conda activate cama
368
  pip install torch==1.12.0+cu116 torchvision==0.13.0+cu116 torchaudio==0.12.0 --extra-index-url https://download.pytorch.org/whl/cu116
369
  pip install -r requirements.txt
370
  ```
@@ -372,9 +379,9 @@ pip install -r requirements.txt
372
 
373
  <h3 id="2-2">2.2 Pretraining model weight acquisition and restoration</h3>
374
 
375
- > Since the Meta has not fully released the weights of LLaMA, we have computed the difference between the CaMA weights and the LLaMA weights and uploaded them [here](https://huggingface.co/zjunlp/CaMA-13B-Diff). To restore the complete CaMA weights, please follow the steps outlined below.
376
 
377
- **1. Download LLaMA 13B and CaMA-13B-Diff**
378
 
379
  Please click [here](https://forms.gle/jk851eBVbX1m5TAv5) to apply for the official pre-training weights of LLaMA from `meta`. In this case, we are using the `13B` version of the model, so you only need to download the `13B` version. Once downloaded, the file directory will be as follows:
380
 
@@ -389,10 +396,16 @@ Please click [here](https://forms.gle/jk851eBVbX1m5TAv5) to apply for the offici
389
  |-- tokenizer_checklist.chk
390
  ```
391
 
392
- You can use the following command to download the `CaMA-diff` file (assuming it is saved in the `./CaMA-Diff` folder):
 
 
 
 
 
393
  ```shell
394
- python tools/download.py --download_path ./CaMA-Diff --only_base
395
  ```
 
396
  > :exclamation:Noted. If the download is interrupted, please repeat the command mentioned above. HuggingFace provides the functionality of resumable downloads, allowing you to resume the download from where it was interrupted.
397
 
398
  **2. Use the conversion script provided by huggingface**
@@ -403,17 +416,23 @@ To convert the original LLaMA-13B model into the HuggingFace format, you can use
403
  python convert_llama_weights_to_hf.py --input_dir ./ --model_size 13B --output_dir ./converted
404
  ```
405
 
406
- **3. Restore CaMA 13B**
 
 
407
 
408
- Use the script we provided, located at `./tools/weight_diff.py`, execute the following command, and you will get the complete `CaMA` weight:
 
 
409
 
 
 
 
410
  ```shell
411
- python tools/weight_diff.py recover --path_raw ./converted --path_diff ./CaMA-Diff --path_tuned ./CaMA
412
  ```
413
 
414
- The final complete CaMA weights are saved in the `./CaMA` folder.
415
 
416
-
417
 
418
  <h3 id="2-3">2.3 Instruction tuning LoRA weight acquisition</h3>
419
 
@@ -431,26 +450,28 @@ The final complete weights are saved in the `./LoRA` folder.
431
 
432
  **1. Reproduce the results in Section 1**
433
 
434
- 1. If you want to reproduce the results in section `1.1`(**pretraining cases**), please run the following command (assuming that the complete pre-training weights of `CaMA` have been obtained according to the steps in section `2.2`, and the CaMA weight is saved in the `./CaMA` folder):
 
 
435
 
436
  ```shell
437
- python examples/generate_finetune.py --base_model ./CaMA
438
  ```
439
 
440
  The result in section `1.1` can be obtained.
441
 
442
- 2. If you want to reproduce the results in section `1.2`(**information extraction cases**), please run the following command (assuming that the LoRA weights of `CaMA` have been obtained according to the steps in section `2.3`, and the LoRA weights is saved in the `./LoRA` folder):
443
 
444
  ```shell
445
- python examples/generate_lora.py --load_8bit --base_model ./CaMA --lora_weights ./LoRA --run_ie_cases
446
  ```
447
 
448
  The result in section `1.2` can be obtained.
449
 
450
- 3. If you want to reproduce the results in section `1.3`(**general ablities cases**), please run the following command (assuming that the LoRA weights of `CaMA` have been obtained according to the steps in section `2.3`, and the LoRA weights is saved in the `./LoRA` folder):
451
 
452
  ```shell
453
- python examples/generate_lora.py --load_8bit --base_model ./CaMA --lora_weights ./LoRA --run_general_cases
454
  ```
455
 
456
  The result in section `1.3` can be obtained.
@@ -464,7 +485,7 @@ We offer two methods: the first one is **command-line interaction**, and the sec
464
  1. Use the following command to enter **command-line interaction**:
465
 
466
  ```shell
467
- python examples/generate_finetune.py --base_model ./CaMA --interactive
468
  ```
469
 
470
  The disadvantage is the inability to dynamically change decoding parameters.
@@ -472,24 +493,25 @@ We offer two methods: the first one is **command-line interaction**, and the sec
472
  2. Use the following command to enter **web-based interaction**:
473
 
474
  ```shell
475
- python examples/generate_finetune_web.py --base_model ./CaMA
476
  ```
477
  Here is a screenshot of the web-based interaction:
478
  <p align="center" width="100%">
479
- <a href="" target="_blank"><img src="https://github.com/zjunlp/CaMA/blob/main/assets/finetune_web.jpg?raw=true" alt="finetune-web" style="width: 100%; min-width: 100px; display: block; margin: auto;"></a>
480
  </p>
481
 
 
482
  **3. Usage of Instruction tuning Model**
483
 
484
  Here, we provide a web-based interaction method. Use the following command to access the web:
485
 
486
  ```shell
487
- python examples/generate_lora_web.py --base_model ./CaMA --lora_weights ./LoRA
488
  ```
489
 
490
  Here is a screenshot of the web-based interaction:
491
  <p align="center" width="100%">
492
- <a href="" target="_blank"><img src="https://github.com/zjunlp/CaMA/blob/main/assets/lora_web.png?raw=true" alt="finetune-web" style="width: 100%; min-width: 100px; display: block; margin: auto;"></a>
493
  </p>
494
 
495
  The `instruction` is a required parameter, while `input` is an optional parameter. For general tasks (such as the examples provided in section `1.3`), you can directly enter the input in the `instruction` field. For information extraction tasks (as shown in the example in section `1.2`), please enter the instruction in the `instruction` field and the sentence to be extracted in the `input` field. We provide an information extraction prompt in section `2.5`.
@@ -500,8 +522,9 @@ If you want to perform batch testing, please modify the `examples/generate_lora.
500
 
501
  <h3 id="2-5">2.5 Information Extraction Prompt</h3>
502
 
503
- For information extraction tasks such as named entity recognition (NER), event extraction (EE), and relation extraction (RE), we provide some prompts for ease of use. You can refer to this [link](./examples/ie_prompt.py) for examples. Of course, you can also try using your own prompts.
504
 
 
505
 
506
 
507
  <h2 id="3">3. Training Details</h2>
@@ -512,7 +535,7 @@ For information extraction tasks such as named entity recognition (NER), event e
512
  >
513
  > (2) Instruction tuning stage using LoRA. This stage enables the model to understand human instructions and generate appropriate responses.
514
 
515
- ![](https://github.com/zjunlp/CaMA/blob/main/assets/main.jpg?raw=true)
516
 
517
  <h3 id="3-1">3.1 Dataset Construction (Pretraining)</h3>
518
 
@@ -522,7 +545,7 @@ For the crawled datasets mentioned above, we employed a heuristic approach to fi
522
 
523
  <h3 id="3-2">3.2 Training Process (Pretraining)</h3>
524
 
525
- Detailed data processing code, training code, complete training scripts, and detailed training results can be found in [./pretrain](./pretrain).
526
 
527
  Before training, we need to tokenize the data. We set the maximum length of a single sample to `1024`, while most documents are much longer than this. Therefore, we need to partition these documents. **We designed a greedy algorithm to split the documents, with the goal of ensuring that each sample consists of complete sentences and minimizing the number of segments while maximizing the length of each sample.** Additionally, due to the diversity of data sources, we developed a comprehensive data preprocessing tool that can process and merge data from various sources. Finally, considering the large amount of data, loading it directly into memory would impose excessive hardware pressure. Therefore, we referred to [DeepSpeed-Megatron](https://github.com/bigscience-workshop/Megatron-DeepSpeed/tree/main/tools) and used the `mmap` method to process and load the data. This involves loading the indices into memory and accessing the corresponding data on disk when needed.
528
 
@@ -554,7 +577,10 @@ In addition, we manually constructed a general Chinese dataset and translated it
554
  | Information Extraction Datasets (English) | 537429 |
555
  | Information Extraction Datasets (Chinese) | 486768 |
556
 
557
-
 
 
 
558
 
559
  <h3 id="3-4">3.4 Training Process (Instruction tuning)</h3>
560
 
 
10
 
11
 
12
  <p align="center" width="100%">
13
+ <a href="" target="_blank"><img src="https://github.com/zjunlp/KnowLM/blob/main/assets/zx.png?raw=true" alt="ZJU-KnowLM" style="width: 30%; min-width: 30px; display: block; margin: auto;"></a>
14
  </p>
15
 
16
 
17
+ > This is the result of the weight difference between `Llama 13B` and `ZhiXi-13B`. You can click [here](https://github.com/zjunlp/KnowLM) to learn more.
18
 
19
 
20
+ # Knowledgable Large Language Model Framework.
21
 
22
+ With the rapid development of deep learning technology, large language models such as ChatGPT have made substantial strides in the realm of natural language processing. However, these expansive models still encounter several challenges in acquiring and comprehending knowledge, including the difficulty of updating knowledge and potential knowledge discrepancies and biases, collectively known as knowledge fallacies. The KnowLM project endeavors to tackle these issues by launching an open-source large-scale knowledgable language model framework and releasing corresponding models.
23
+
24
+ The project's `initial phase` introduced a knowledge extraction LLM based on LLaMA, dubbed **ZhiXi (智析)**. To integrate the capacity of Chinese understanding into the language models without compromising their inherent knowledge, we firstly <b>(1) use Chinese corpora for the full-scale pre-training with LLaMA (13B), augment the language model's understanding of Chinese and improve its knowledge richness while retaining its original English and code capacities;</b> Then <b>(2) we fine-tune the model obtained from the first step with an instruction dataset, thus bolstering the language model's understanding of human instructions for knowledge extraction.</b>
25
+ - ❗Please note that this project is still undergoing optimization, and the model weights will be regularly updated to support new features and models!
26
 
27
  **The features of this project are as follows:**
28
 
29
+ - Centered on knowledge and large models, a **full-scale pre-training** of the large model, such as LLaMA, is conducted using the built Chinese&English pre-training corpus.
30
+ - Based on the technology of **KG2Instructions**, the knowledge extraction tasks, including NER, RE, and IE, are optimized and can be completed using human instructions.
31
+ - Using the built Chinese instruction dataset (approximately 1400K), LoRA fine-tuning is used to enhance the model's understanding of human instructions.
32
+ - The weights of the pre-training model and LoRA's instruction fine-tuning are open-sourced.
33
+ - The **full-scale pre-training code** (providing conversion, construction, and loading of large corpora) and **LoRA instruction fine-tuning code** are open-sourced (support multi-machine multi-GPU).
34
 
35
 
36
+ All weights have been uploaded to Hugging Face. The ZhiXi differential weights can be found [here](https://huggingface.co/zjunlp/zhixi-13B-Diff), and the LoRA weights can be found [here](https://huggingface.co/zjunlp/zhixi-13B-LoRA).
37
 
38
  ## Contents
39
 
 
209
  The effectiveness of information extraction is illustrated in the following figure. We tested different instructions for different tasks as well as the same instructions for the same task, and achieved good results for all of them.
210
 
211
  <p align="center" width="100%">
212
+ <a href="" target="_blank"><img src="https://github.com/zjunlp/KnowLM/blob/main/assets/ie-case-new_logo-en.png?raw=true" alt="IE" style="width: 60%; min-width: 60px; display: block; margin: auto;"></a>
213
  </p>
214
 
215
+ Compared to other large models like ChatGPT, as shown in the graph, it can be observed that our model achieves more accurate and comprehensive extraction results. However, we have also identified some extraction errors in ZhiXi. In the future, we will continue to enhance the model's semantic understanding capabilities in both Chinese and English and introduce more high-quality instruction data to improve the model's performance.
216
 
217
+ <p align="center" width="100%">
218
+ <a href="" target="_blank"><img src="https://github.com/zjunlp/KnowLM/blob/main/assets/casevschatgpt.png?raw=true" alt="IE-cases-vs-chatgpt"></a>
219
+ </p>
220
 
221
  <h3 id="1-3">1.3 General Ablities Cases</h3>
222
 
 
370
  <h3 id="2-1">2.1 Environment Configuration</h3>
371
 
372
  ```shell
373
+ conda create -n zhixi python=3.9 -y
374
+ conda activate zhixi
375
  pip install torch==1.12.0+cu116 torchvision==0.13.0+cu116 torchaudio==0.12.0 --extra-index-url https://download.pytorch.org/whl/cu116
376
  pip install -r requirements.txt
377
  ```
 
379
 
380
  <h3 id="2-2">2.2 Pretraining model weight acquisition and restoration</h3>
381
 
382
+ ❗❗❗ Note that in terms of hardware, performing step `2.2`, which involves merging LLaMA-13B with ZhiXI-13B-Diff, requires approximately **100GB** of RAM, with no demand for VRAM (this is due to the memory overhead caused by our merging strategy. To facilitate usage, we will improve our merging approach in future updates, and we are currently developing a 7B model as well, so stay tuned). For step `2.4`, which involves inference using `ZhiXi`, a minimum of **26GB** of VRAM is required.
383
 
384
+ **1. Download LLaMA 13B and ZhiXi-13B-Diff**
385
 
386
  Please click [here](https://forms.gle/jk851eBVbX1m5TAv5) to apply for the official pre-training weights of LLaMA from `meta`. In this case, we are using the `13B` version of the model, so you only need to download the `13B` version. Once downloaded, the file directory will be as follows:
387
 
 
396
  |-- tokenizer_checklist.chk
397
  ```
398
 
399
+ You can use the following command to download the `ZhiXi-13B-Diff` file (assuming it is saved in the `./zhixi-diff` folder):
400
+ ```shell
401
+ python tools/download.py --download_path ./zhixi-diff --only_base
402
+ ```
403
+
404
+ If you want to download the diff weights in the fp16 format, please use the following command (assuming it is saved in the `./zhixi-diff-fp16` folder):
405
  ```shell
406
+ python tools/download.py --download_path ./zhixi-diff-fp16 --only_base --fp16
407
  ```
408
+
409
  > :exclamation:Noted. If the download is interrupted, please repeat the command mentioned above. HuggingFace provides the functionality of resumable downloads, allowing you to resume the download from where it was interrupted.
410
 
411
  **2. Use the conversion script provided by huggingface**
 
416
  python convert_llama_weights_to_hf.py --input_dir ./ --model_size 13B --output_dir ./converted
417
  ```
418
 
419
+ **3. Restore ZhiXi 13B**
420
+
421
+ Use the script we provided, located at `./tools/weight_diff.py`, execute the following command, and you will get the complete `ZhiXi` weight:
422
 
423
+ ```shell
424
+ python tools/weight_diff.py recover --path_raw ./converted --path_diff ./zhixi-diff --path_tuned ./zhixi
425
+ ```
426
 
427
+ The final complete ZhiXi weights are saved in the `./zhixi` folder.
428
+
429
+ If you have downloaded the diff weights version in fp16 format, you can obtain them using the following command. Please note that there might be slight differences compared to the weights obtained in fp32 format:
430
  ```shell
431
+ python tools/weight_diff.py recover --path_raw ./converted --path_diff ./zhixi-diff-fp16 --path_tuned ./zhixi
432
  ```
433
 
434
+ > ❗NOTE. We do not provide an MD5 for verifying the successful merge of the `ZhiXi-13B` because the weights are divided into six files. We employ the same validation strategy as [Stanford Alpaca](https://github.com/tatsu-lab/stanford_alpaca), which involves performing a sum check on the weights (you can refer to this [link](https://github.com/zjunlp/KnowLLM/blob/main/tools/weight_diff.py#L108)). **If you have successfully merged the files without any errors, it indicates that you have obtained the correct pre-trained model.**
435
 
 
436
 
437
  <h3 id="2-3">2.3 Instruction tuning LoRA weight acquisition</h3>
438
 
 
450
 
451
  **1. Reproduce the results in Section 1**
452
 
453
+ > The cases in `Section 1` were all run on V100. If running on other devices, the results may vary. Please run multiple times or change the decoding parameters.
454
+
455
+ 1. If you want to reproduce the results in section `1.1`(**pretraining cases**), please run the following command (assuming that the complete pre-training weights of `ZhiXi` have been obtained according to the steps in section `2.2`, and the ZhiXi weight is saved in the `./zhixi` folder):
456
 
457
  ```shell
458
+ python examples/generate_finetune.py --base_model ./zhixi
459
  ```
460
 
461
  The result in section `1.1` can be obtained.
462
 
463
+ 2. If you want to reproduce the results in section `1.2`(**information extraction cases**), please run the following command (assuming that the LoRA weights of `ZhiXi` have been obtained according to the steps in section `2.3`, and the LoRA weights is saved in the `./lora` folder):
464
 
465
  ```shell
466
+ python examples/generate_lora.py --load_8bit --base_model ./zhixi --lora_weights ./lora --run_ie_cases
467
  ```
468
 
469
  The result in section `1.2` can be obtained.
470
 
471
+ 3. If you want to reproduce the results in section `1.3`(**general ablities cases**), please run the following command (assuming that the LoRA weights of `ZhiXi` have been obtained according to the steps in section `2.3`, and the LoRA weights is saved in the `./lora` folder):
472
 
473
  ```shell
474
+ python examples/generate_lora.py --load_8bit --base_model ./zhixi --lora_weights ./lora --run_general_cases
475
  ```
476
 
477
  The result in section `1.3` can be obtained.
 
485
  1. Use the following command to enter **command-line interaction**:
486
 
487
  ```shell
488
+ python examples/generate_finetune.py --base_model ./zhixi --interactive
489
  ```
490
 
491
  The disadvantage is the inability to dynamically change decoding parameters.
 
493
  2. Use the following command to enter **web-based interaction**:
494
 
495
  ```shell
496
+ python examples/generate_finetune_web.py --base_model ./zhixi
497
  ```
498
  Here is a screenshot of the web-based interaction:
499
  <p align="center" width="100%">
500
+ <a href="" target="_blank"><img src="https://github.com/zjunlp/KnowLM/blob/main/assets/finetune_web.jpg?raw=true" alt="finetune-web" style="width: 100%; min-width: 100px; display: block; margin: auto;"></a>
501
  </p>
502
 
503
+
504
  **3. Usage of Instruction tuning Model**
505
 
506
  Here, we provide a web-based interaction method. Use the following command to access the web:
507
 
508
  ```shell
509
+ python examples/generate_lora_web.py --base_model ./zhixi --lora_weights ./lora
510
  ```
511
 
512
  Here is a screenshot of the web-based interaction:
513
  <p align="center" width="100%">
514
+ <a href="" target="_blank"><img src="https://github.com/zjunlp/KnowLM/blob/main/assets/lora_web.png?raw=true" alt="finetune-web" style="width: 100%; min-width: 100px; display: block; margin: auto;"></a>
515
  </p>
516
 
517
  The `instruction` is a required parameter, while `input` is an optional parameter. For general tasks (such as the examples provided in section `1.3`), you can directly enter the input in the `instruction` field. For information extraction tasks (as shown in the example in section `1.2`), please enter the instruction in the `instruction` field and the sentence to be extracted in the `input` field. We provide an information extraction prompt in section `2.5`.
 
522
 
523
  <h3 id="2-5">2.5 Information Extraction Prompt</h3>
524
 
525
+ For information extraction tasks such as named entity recognition (NER), event extraction (EE), and relation extraction (RE), we provide some prompts for ease of use. You can refer to this [link](https://github.com/zjunlp/KnowLM/blob/main/examples/ie_prompt.py) for examples. Of course, you can also try using your own prompts.
526
 
527
+ Here is a [case](https://github.com/zjunlp/DeepKE/blob/main/example/llm/InstructKGC/README.md) where ZhiXi-13B-LoRA is used to accomplish the instruction-based knowledge graph construction task in CCKS2023.
528
 
529
 
530
  <h2 id="3">3. Training Details</h2>
 
535
  >
536
  > (2) Instruction tuning stage using LoRA. This stage enables the model to understand human instructions and generate appropriate responses.
537
 
538
+ ![](https://github.com/zjunlp/KnowLM/blob/main/assets/main_new.jpg?raw=true)
539
 
540
  <h3 id="3-1">3.1 Dataset Construction (Pretraining)</h3>
541
 
 
545
 
546
  <h3 id="3-2">3.2 Training Process (Pretraining)</h3>
547
 
548
+ Detailed data processing code, training code, complete training scripts, and detailed training results can be found in [./pretrain](https://github.com/zjunlp/KnowLM/blob/main/pretrain).
549
 
550
  Before training, we need to tokenize the data. We set the maximum length of a single sample to `1024`, while most documents are much longer than this. Therefore, we need to partition these documents. **We designed a greedy algorithm to split the documents, with the goal of ensuring that each sample consists of complete sentences and minimizing the number of segments while maximizing the length of each sample.** Additionally, due to the diversity of data sources, we developed a comprehensive data preprocessing tool that can process and merge data from various sources. Finally, considering the large amount of data, loading it directly into memory would impose excessive hardware pressure. Therefore, we referred to [DeepSpeed-Megatron](https://github.com/bigscience-workshop/Megatron-DeepSpeed/tree/main/tools) and used the `mmap` method to process and load the data. This involves loading the indices into memory and accessing the corresponding data on disk when needed.
551
 
 
577
  | Information Extraction Datasets (English) | 537429 |
578
  | Information Extraction Datasets (Chinese) | 486768 |
579
 
580
+ **Flow diagram of KG2Instruction and other instruction fine-tuning datasets**
581
+ <p align="center" width="100%">
582
+ <a href="" target="_blank"><img src="https://github.com/zjunlp/KnowLM/blob/main/assets/kg2instructions-en.png?raw=true"style="width: 90%; min-width: 90px; display: block; margin: auto;"></a>
583
+ </p>
584
 
585
  <h3 id="3-4">3.4 Training Process (Instruction tuning)</h3>
586