Upload 3 files
Browse files- adapter_config.json +19 -0
- adapter_model.bin +3 -0
- readme.md +140 -0
adapter_config.json
ADDED
@@ -0,0 +1,19 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"base_model_name_or_path": "/public/home/xlwang2/codes/zhr/ChatGLM2-6B/chatglm2-6b",
|
3 |
+
"bias": "none",
|
4 |
+
"fan_in_fan_out": false,
|
5 |
+
"inference_mode": true,
|
6 |
+
"init_lora_weights": true,
|
7 |
+
"lora_alpha": 32.0,
|
8 |
+
"lora_dropout": 0.1,
|
9 |
+
"modules_to_save": null,
|
10 |
+
"peft_type": "LORA",
|
11 |
+
"r": 8,
|
12 |
+
"target_modules": [
|
13 |
+
"query_key_value",
|
14 |
+
"dense",
|
15 |
+
"dense_h_to_4h",
|
16 |
+
"dense_4h_to_h"
|
17 |
+
],
|
18 |
+
"task_type": "CAUSAL_LM"
|
19 |
+
}
|
adapter_model.bin
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:a6c896d609aa28f94d66a44ef0e8a561b6be37fc184a6dc953cf7b62c0212460
|
3 |
+
size 59375373
|
readme.md
ADDED
@@ -0,0 +1,140 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
<div align=center>
|
2 |
+
<img src="https://github.com/RUC-GSAI/YuLan-IR/blob/main/yulan.jpg" width="400px">
|
3 |
+
<h1>RETA-LLM: A Retrieval-Augmented Large Language Model Toolkit</h1>
|
4 |
+
<a href="https://github.com/RUC-GSAI/YuLan-IR">
|
5 |
+
<img src="https://img.shields.io/badge/MIT-License-blue" alt="license">
|
6 |
+
<img src="https://img.shields.io/github/stars/RUC-GSAI/YuLan-IR" alt="license">
|
7 |
+
</div>
|
8 |
+
|
9 |
+
|
10 |
+
**RETA-LLM** is a **RET**reival-**A**ugmented LLM toolkit to support research in retrieval-augmented generation and to help users build their own in-down LLM-based systems. RETA-LLM provides five plug-and-play modules to support better interaction between IR systems and LLMs, including **request rewriting, document retrieval, passage extraction, answer generation, and fact checking** modules. A complete pipeline are also provided for researchers and users to build their RETA-LLM toolkits based on their own repository for in-domain LLM-based systems from scratch. Our paper can be found at [paper](https://github.com/RUC-GSAI/YuLan-IR/blob/main/RETA-LLM/resource/paper.pdf).
|
11 |
+
|
12 |
+
## Table of Contents
|
13 |
+
|
14 |
+
- [Table of Contents](#table-of-contents)
|
15 |
+
- [Background](#background)
|
16 |
+
- [Introduction](#introduction)
|
17 |
+
- [Requirements](#requirements)
|
18 |
+
- [Usage](#usage)
|
19 |
+
- [Case](#case)
|
20 |
+
- [To-Do](#to-do)
|
21 |
+
- [Maintainers](#maintainers)
|
22 |
+
- [Acknowledgements](#acknowledgements)
|
23 |
+
- [License](#license)
|
24 |
+
|
25 |
+
|
26 |
+
## Background
|
27 |
+
|
28 |
+
Large Language Models (LLMs) have shown extraordinary abilities in many areas. However, studies have shown that they still tend to hallucinate and generate responses opposite to facts sometimes. To solve these problems, researchers propose a new paradigm to strengthen LLMs with information retrieval systems (Retrieval-augmented LLMs), which enable LLMs to look up relevant contents from external IR system. Furthermore, by enhancing in-domain data resources, Retrieval-augmented LLMs can answer in-domain questions such as "Who is the dean of the Gaoling School of Artificial Intelligence of Renmin University of China?"
|
29 |
+
|
30 |
+
## Introduction
|
31 |
+
|
32 |
+
To support research in this area and help users build their own in-domain QA system, we devise **RETA-LLM**, a **RET**reival-**A**ugmented LLM toolkit. Compared with previous LLM toolkits such as Langchains, our RETA tookit focuses on retrieval-augmented LLMs and provides more optional plug-in modules. We also disentangles the LLMs and IR system more entirely, which makes you can customize search engines and LLMs.
|
33 |
+
|
34 |
+
The overall framework of our toolkit is shown as follows: ![RETA-LLM Framework](./resource/framework.jpg)
|
35 |
+
|
36 |
+
In general, there includes five steps/modules in our RETA-LLM tookit.
|
37 |
+
|
38 |
+
- **Request Rewriting**: First, RETA-LLM utilizes LLMs to revise current request of users based on their histories to make it complete and clear.
|
39 |
+
- **Doc Retrieval**: Second, RETA-LLM uses the revised user request to retrieve relevant documents from customized document corpus. In our demo, we use [disentangled-retriever](https://github.com/jingtaozhan/disentangled-retriever) as retriever for html materials in Chinese. you can customize your own searcher.
|
40 |
+
- **Passage Extraction**: Third, since concatenating the whole relevant document content may be too long for LLMs to generate responses, RETA-LLM extracts relevance document fragments/passages by LLMs from the retrieved documents to form references for generation.
|
41 |
+
- **Answer Generation**: Fourth, RETA-LLM provides the revised user request and references for LLM to generate answers.
|
42 |
+
- **Fact checking**: Finally, RETA-LLM applies LLMs to verify whether the generate answers contain factual mistakes and output final responses for user request.
|
43 |
+
|
44 |
+
## Requirements
|
45 |
+
The requirements of our RETA-LLM toolkit is wrapped in the `environment.yml` file, install them by :
|
46 |
+
|
47 |
+
```
|
48 |
+
cd the-root-path-of-this-repository
|
49 |
+
conda env create -f environment.yml
|
50 |
+
conda activate retallm
|
51 |
+
|
52 |
+
pip install adapter-transformers --force-reinstall
|
53 |
+
pip install transformers==4.28.0 --force-reinstall
|
54 |
+
git clone https://github.com/adapter-hub/adapter-transformers.git
|
55 |
+
mv adapter-transformers adaptertransformers
|
56 |
+
|
57 |
+
#The above four lines are used to fix the conflicts between adapter-transformers and transfomers. Please don't adjust the order bewteen them.
|
58 |
+
|
59 |
+
```
|
60 |
+
|
61 |
+
|
62 |
+
## Usage
|
63 |
+
|
64 |
+
We provide a complete pipeline to help you use your own customized materials (e.g. html files crawled from websites) to build your own RETA-LLM toolkit. The pipeline is as follows:
|
65 |
+
|
66 |
+
0. Prepare your html file resouces in the `raw_data` folder and the mapping table file `url.txt` that maps the websited urls to the filename; The `url.txt` should be in a tsv format. The format of each line is:
|
67 |
+
|
68 |
+
`file name(without ".html") \t url`
|
69 |
+
|
70 |
+
We give example data and url_file in `sample_data.zip` and `sample_url.txt`.
|
71 |
+
Follow the usage guidelines, you can build a RUC-enrollment-assistant using them.
|
72 |
+
```
|
73 |
+
unzip sample_data.zip
|
74 |
+
mv sample_data raw_data
|
75 |
+
mv sample_url.txt url.txt
|
76 |
+
```
|
77 |
+
|
78 |
+
1. Run the `html2json.py` in the `html2json` folder to convert html resources to json files.
|
79 |
+
```
|
80 |
+
cd html2json
|
81 |
+
python html2json.py --input_dir ../raw_data --output_dir ../json_data --url_file ../url.txt
|
82 |
+
python deduplication.py # This code is used to remove duplicated n-grams in the processed json files among all html files.
|
83 |
+
cd ..
|
84 |
+
```
|
85 |
+
The `json_data` is the ouput data directory containing json files.
|
86 |
+
|
87 |
+
2. Run the `index_pipeline.py` in the `indexer` folder to build faiss-supported index. Specifically, this indexer is designed for materials in Chinese and English, if you want to index materials for other languages. please adjust `index_pipeline.py`.
|
88 |
+
```
|
89 |
+
cd indexer
|
90 |
+
python index_pipeline.py --index_type all --data_dir ../json_data --index_save_dir ../index --batch_size 128 --use_content_type all --train_dam_flag --language zh
|
91 |
+
cd ..
|
92 |
+
```
|
93 |
+
The `index` is the faiss-supported index directory. The args `--use_content_type` is used to indicate which parts (title, contents, all) of the documents is to used to build indexes. We suggest to conduct domain adaption with the args `--train_dam_flag`. If you choose not to, remove the `--train_dam_flag` args and change the `DAM_NAME` config in `./system/config.py` folder.
|
94 |
+
|
95 |
+
3. Prepare an LLM and its generating configuration json file in the `system` folder. Example jsons for [YuLan](https://github.com/RUC-GSAI/YuLan-Chat), [ChatGLM](https://github.com/THUDM/ChatGLM-6B), [Alpaca](https://github.com/tatsu-lab/stanford_alpaca) is shown in `system/llm_*.json`. The genrating configuration mainly include the model path, temperature, top_p, top_k, etc. Specifically, you can even use differnt LLMs in different modules.
|
96 |
+
|
97 |
+
4. Run the `web_demo.py` in `system` folder to start serving.
|
98 |
+
```
|
99 |
+
cd system
|
100 |
+
streamlit run web_demo.py --server.port 1241
|
101 |
+
```
|
102 |
+
Then you can try your own RETA-LLM toolkit on server.ip:1241 !
|
103 |
+
|
104 |
+
**The configuration of the `web_demo.py` is in the `config.py` in `system` folder**. Please adjust the configuration if you use your own data.
|
105 |
+
|
106 |
+
For the LLMs, we provide the model loading and response template for [YuLan](https://github.com/RUC-GSAI/YuLan-Chat), [ChatGLM](https://github.com/THUDM/ChatGLM-6B), [Alpaca](https://github.com/tatsu-lab/stanford_alpaca) and ChatGPT in `load_model.py` and `model_response.py` in `system` folder, If you want to use other LLMs, please adjust these two files.
|
107 |
+
|
108 |
+
For the searchers, we define a template for your customized searcher, see it in the `Common_Searcher` class in the `./system/searcher.py`.
|
109 |
+
|
110 |
+
## Case
|
111 |
+
![RETA-LLM case](./resource/case_zh.jpg)
|
112 |
+
|
113 |
+
A case of in-domain QA supported by RETA-LLM is shown. In this case, we use the provided `sample_data` as external knowledge but conduct more fine-grained html parsing.
|
114 |
+
|
115 |
+
|
116 |
+
## To-Do
|
117 |
+
RETA-LLM is still under development and there are many issues that need to be solved with great efforts. We sincerely welcome contributions on this open source toolkit.
|
118 |
+
|
119 |
+
- [ ] Better fact checking module.
|
120 |
+
- [ ] Add parser for .doc / .pdf / .ppt resources.
|
121 |
+
- [ ] Add active retreival augumentation.
|
122 |
+
- [ ] More modulized and configurable.
|
123 |
+
- [ ] ...
|
124 |
+
|
125 |
+
|
126 |
+
|
127 |
+
## Maintainers
|
128 |
+
<div>
|
129 |
+
<a href="https://github.com/rucliujn">@Jiongnan Liu</a>
|
130 |
+
<a href="https://github.com/ignorejjj">@Jiajie Jin</a>
|
131 |
+
</div>
|
132 |
+
|
133 |
+
|
134 |
+
|
135 |
+
## Acknowledgements
|
136 |
+
Thanks Jingtao for the great implementation of [disentangled-retriever](https://github.com/jingtaozhan/disentangled-retriever).
|
137 |
+
|
138 |
+
|
139 |
+
## License
|
140 |
+
RETA-LLM uses [MIT License](https://github.com/RUC-GSAI/YuLan-IR/tree/main/RETA-LLM/LICENSE). All data and code in this project can only be used for academic purposes.
|