writinwaters commited on
Commit
8b7269c
·
1 Parent(s): 22fe41e

Updated RAGFlow UI (#3362)

Browse files

### What problem does this PR solve?



### Type of change


- [x] Documentation Update

docker/README.md CHANGED
@@ -102,13 +102,19 @@ The [.env](./.env) file contains important environment variables for Docker.
102
  > - `RAGFLOW_IMAGE=swr.cn-north-4.myhuaweicloud.com/infiniflow/ragflow:dev` or,
103
  > - `RAGFLOW_IMAGE=registry.cn-hangzhou.aliyuncs.com/infiniflow/ragflow:dev`.
104
 
105
- ### Miscellaneous
106
 
107
  - `TIMEZONE`
108
  The local time zone. Defaults to `'Asia/Shanghai'`.
 
 
 
109
  - `HF_ENDPOINT`
110
  The mirror site for huggingface.co. It is disabled by default. You can uncomment this line if you have limited access to the primary Hugging Face domain.
111
- - `MACOS`  
 
 
 
112
  Optimizations for MacOS. It is disabled by default. You can uncomment this line if your OS is MacOS.
113
 
114
  ## 🐋 Service configuration
 
102
  > - `RAGFLOW_IMAGE=swr.cn-north-4.myhuaweicloud.com/infiniflow/ragflow:dev` or,
103
  > - `RAGFLOW_IMAGE=registry.cn-hangzhou.aliyuncs.com/infiniflow/ragflow:dev`.
104
 
105
+ ### Timezone
106
 
107
  - `TIMEZONE`
108
  The local time zone. Defaults to `'Asia/Shanghai'`.
109
+
110
+ ### Hugging Face mirror site
111
+
112
  - `HF_ENDPOINT`
113
  The mirror site for huggingface.co. It is disabled by default. You can uncomment this line if you have limited access to the primary Hugging Face domain.
114
+
115
+ ### MacOS
116
+
117
+ - `MACOS`
118
  Optimizations for MacOS. It is disabled by default. You can uncomment this line if your OS is MacOS.
119
 
120
  ## 🐋 Service configuration
docs/configurations.md CHANGED
@@ -123,13 +123,19 @@ If you cannot download the RAGFlow Docker image, try the following mirrors.
123
  - `RAGFLOW_IMAGE=registry.cn-hangzhou.aliyuncs.com/infiniflow/ragflow:dev`.
124
  :::
125
 
126
- ### Miscellaneous
127
 
128
  - `TIMEZONE`
129
  The local time zone. Defaults to `'Asia/Shanghai'`.
 
 
 
130
  - `HF_ENDPOINT`
131
  The mirror site for huggingface.co. It is disabled by default. You can uncomment this line if you have limited access to the primary Hugging Face domain.
132
- - `MACOS`  
 
 
 
133
  Optimizations for MacOS. It is disabled by default. You can uncomment this line if your OS is MacOS.
134
 
135
  ## Service configuration
 
123
  - `RAGFLOW_IMAGE=registry.cn-hangzhou.aliyuncs.com/infiniflow/ragflow:dev`.
124
  :::
125
 
126
+ ### Timezone
127
 
128
  - `TIMEZONE`
129
  The local time zone. Defaults to `'Asia/Shanghai'`.
130
+
131
+ ### Hugging Face mirror site
132
+
133
  - `HF_ENDPOINT`
134
  The mirror site for huggingface.co. It is disabled by default. You can uncomment this line if you have limited access to the primary Hugging Face domain.
135
+
136
+ ### MacOS
137
+
138
+ - `MACOS`
139
  Optimizations for MacOS. It is disabled by default. You can uncomment this line if your OS is MacOS.
140
 
141
  ## Service configuration
web/src/locales/en.ts CHANGED
@@ -200,43 +200,39 @@ export default {
200
  methodEmpty:
201
  'This will display a visual explanation of the knowledge base categories',
202
  book: `<p>Supported file formats are <b>DOCX</b>, <b>PDF</b>, <b>TXT</b>.</p><p>
203
- Since a book is long and not all the parts are useful, if it's a PDF,
204
- please setup the <i>page ranges</i> for every book in order eliminate negative effects and save computing time for analyzing.</p>`,
205
  laws: `<p>Supported file formats are <b>DOCX</b>, <b>PDF</b>, <b>TXT</b>.</p><p>
206
- Legal documents have a very rigorous writing format. We use text feature to detect split point.
207
  </p><p>
208
- The chunk granularity is consistent with 'ARTICLE', and all the upper level text will be included in the chunk.
209
  </p>`,
210
  manual: `<p>Only <b>PDF</b> is supported.</p><p>
211
  We assume that the manual has a hierarchical section structure, using the lowest section titles as basic unit for chunking documents. Therefore, figures and tables in the same section will not be separated, which may result in larger chunk sizes.
212
  </p>`,
213
  naive: `<p>Supported file formats are <b>DOCX, EXCEL, PPT, IMAGE, PDF, TXT, MD, JSON, EML, HTML</b>.</p>
214
- <p>This method apply the naive ways to chunk files: </p>
215
  <p>
216
- <li>Successive text will be sliced into pieces using vision detection model.</li>
217
- <li>Next, these successive pieces are merge into chunks whose token number is no more than 'Token number'.</li></p>`,
218
  paper: `<p>Only <b>PDF</b> file is supported.</p><p>
219
- If our model works well, the paper will be sliced by it's sections, like <i>abstract, 1.1, 1.2</i>, etc. </p><p>
220
- The benefit of doing this is that LLM can better summarize the content of relevant sections in the paper,
221
- resulting in more comprehensive answers that help readers better understand the paper.
222
- The downside is that it increases the context of the LLM conversation and adds computational cost,
223
- so during the conversation, you can consider reducing the ‘<b>topN</b>’ setting.</p>`,
224
- presentation: `<p>The supported file formats are <b>PDF</b>, <b>PPTX</b>.</p><p>
225
- Every page will be treated as a chunk. And the thumbnail of every page will be stored.</p><p>
226
- <i>All the PPT files you uploaded will be chunked by using this method automatically, setting-up for every PPT file is not necessary.</i></p>`,
227
  qa: `
228
  <p>
229
  This chunk method supports <b>EXCEL</b> and <b>CSV/TXT</b> file formats.
230
  </p>
231
  <li>
232
- If the file is in <b>Excel</b> format, it should consist of two columns
233
  without headers: one for questions and the other for answers, with the
234
  question column preceding the answer column. Multiple sheets are
235
- acceptable as long as the columns are correctly structured.
236
  </li>
237
  <li>
238
- If the file is in <b>CSV/TXT</b> format, it must be UTF-8 encoded with TAB
239
- used as the delimiter to separate questions and answers.
240
  </li>
241
  <p>
242
  <i>
@@ -245,25 +241,20 @@ export default {
245
  </i>
246
  </p>
247
  `,
248
- resume: `<p>The supported file formats are <b>DOCX</b>, <b>PDF</b>, <b>TXT</b>.
249
  </p><p>
250
- The résumé comes in a variety of formats, just like a person’s personality, but we often have to organize them into structured data that makes it easy to search.
251
- </p><p>
252
- Instead of chunking the résumé, we parse the résumé into structured data. As a HR, you can dump all the résumé you have,
253
- the you can list all the candidates that match the qualifications just by talk with <i>'RAGFlow'</i>.
254
  </p>
255
  `,
256
- table: `<p><b>EXCEL</b> and <b>CSV/TXT</b> format files are supported.</p><p>
257
- Here're some tips:
258
  <ul>
259
- <li>For csv or txt file, the delimiter between columns is <em><b>TAB</b></em>.</li>
260
- <li>The first line must be column headers.</li>
261
- <li>Column headers must be meaningful terms in order to make our LLM understanding.
262
- It's good to enumerate some synonyms using slash <i>'/'</i> to separate, and even better to
263
- enumerate values using brackets like <i>'gender/sex(male, female)'</i>.<p>
264
- Here are some examples for headers:<ol>
265
- <li>supplier/vendor<b>'TAB'</b>color(yellow, red, brown)<b>'TAB'</b>gender/sex(male, female)<b>'TAB'</b>size(M,L,XL,XXL)</li>
266
- <li>姓名/名字<b>'TAB'</b>电话/手机/微信<b>'TAB'</b>最高学历(高中,职高,硕士,本科,博士,初中,中技,中专,专科,专升本,MPA,MBA,EMBA)</li>
267
  </ol>
268
  </p>
269
  </li>
 
200
  methodEmpty:
201
  'This will display a visual explanation of the knowledge base categories',
202
  book: `<p>Supported file formats are <b>DOCX</b>, <b>PDF</b>, <b>TXT</b>.</p><p>
203
+ For each book in PDF, please set the <i>page ranges</i> to remove unwanted information and reduce analysis time.</p>`,
 
204
  laws: `<p>Supported file formats are <b>DOCX</b>, <b>PDF</b>, <b>TXT</b>.</p><p>
205
+ Legal documents typically follow a rigorous writing format. We use text feature to identify split point.
206
  </p><p>
207
+ The chunk has a granularity consistent with 'ARTICLE', ensuring all upper level text is included in the chunk.
208
  </p>`,
209
  manual: `<p>Only <b>PDF</b> is supported.</p><p>
210
  We assume that the manual has a hierarchical section structure, using the lowest section titles as basic unit for chunking documents. Therefore, figures and tables in the same section will not be separated, which may result in larger chunk sizes.
211
  </p>`,
212
  naive: `<p>Supported file formats are <b>DOCX, EXCEL, PPT, IMAGE, PDF, TXT, MD, JSON, EML, HTML</b>.</p>
213
+ <p>This method chunks files using the 'naive' way: </p>
214
  <p>
215
+ <li>Use vision detection model to split the texts into smaller segments.</li>
216
+ <li>Then, combine adjacent segments until the token count exceeds the threshold specified by 'Chunk token number', at which point a chunk is created.</li></p>`,
217
  paper: `<p>Only <b>PDF</b> file is supported.</p><p>
218
+ Papers will be split by section, such as <i>abstract, 1.1, 1.2</i>. </p><p>
219
+ This approach enables the LLM to summarize the paper more effectively and provide more comprehensive, understandable responses.
220
+ However, it also increases the context for AI conversations and adds to the computational cost for the LLM. So during a conversation, consider reducing the value of ‘<b>topN</b>’.</p>`,
221
+ presentation: `<p>Supported file formats are <b>PDF</b>, <b>PPTX</b>.</p><p>
222
+ Every page in the slides is treated as a chunk, with its thumbnail image stored.</p><p>
223
+ <i>This chunk method is automatically applied to all uploaded PPT files, so you do not need to specify it manually.</i></p>`,
 
 
224
  qa: `
225
  <p>
226
  This chunk method supports <b>EXCEL</b> and <b>CSV/TXT</b> file formats.
227
  </p>
228
  <li>
229
+ If a file is in <b>Excel</b> format, it should contain two columns
230
  without headers: one for questions and the other for answers, with the
231
  question column preceding the answer column. Multiple sheets are
232
+ acceptable, provided the columns are properly structured.
233
  </li>
234
  <li>
235
+ If a file is in <b>CSV/TXT</b> format, it must be UTF-8 encoded with TAB as the delimiter to separate questions and answers.
 
236
  </li>
237
  <p>
238
  <i>
 
241
  </i>
242
  </p>
243
  `,
244
+ resume: `<p>Supported file formats are <b>DOCX</b>, <b>PDF</b>, <b>TXT</b>.
245
  </p><p>
246
+ Résumés of various forms are parsed and organized into structured data to facilitate candidate search for recruiters.
 
 
 
247
  </p>
248
  `,
249
+ table: `<p>Supported file formats are <b>EXCEL</b> and <b>CSV/TXT</b>.</p><p>
250
+ Here're some prerequisites and tips:
251
  <ul>
252
+ <li>For CSV or TXT file, the delimiter between columns must be <em><b>TAB</b></em>.</li>
253
+ <li>The first row must be column headers.</li>
254
+ <li>Column headers must be meaningful terms to aid your LLM's understanding.
255
+ It is good practice to juxtapose synonyms separated by a slash <i>'/'</i> and to enumerate values using brackets, for example: <i>'Gender/Sex (male, female)'</i>.<p>
256
+ Here are some examples of headers:<ol>
257
+ <li>supplier/vendor<b>'TAB'</b>Color (Yellow, Blue, Brown)<b>'TAB'</b>Sex/Gender (male, female)<b>'TAB'</b>size (M, L, XL, XXL)</li>
 
 
258
  </ol>
259
  </p>
260
  </li>