Spaces:
Build error
Build error
File size: 12,516 Bytes
a8d4e3d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 |
Metadata-Version: 2.1
Name: Auto-Research
Version: 1.0
Summary: Geberate scientific survey with just a query
Home-page: https://github.com/sidphbot/Auto-Research
Author: Sidharth Pal
Author-email: sidharth.pal1992@gmail.com
License: UNKNOWN
Project-URL: Docs, https://github.com/example/example/README.md
Project-URL: Bug Tracker, https://github.com/sidphbot/Auto-Research/issues
Project-URL: Demo, https://www.kaggle.com/sidharthpal/auto-research-generate-survey-from-query
Platform: UNKNOWN
Classifier: Development Status :: 5 - Production/Stable
Classifier: Environment :: Console
Classifier: Environment :: Other Environment
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Education
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Other Audience
Classifier: Topic :: Education
Classifier: Topic :: Education :: Computer Aided Instruction (CAI)
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Scientific/Engineering :: Medical Science Apps.
Classifier: Topic :: Scientific/Engineering :: Physics
Classifier: Natural Language :: English
Classifier: License :: OSI Approved :: GNU General Public License (GPL)
Classifier: License :: OSI Approved :: GNU Library or Lesser General Public License (LGPL)
Classifier: License :: OSI Approved :: GNU Lesser General Public License v3 or later (LGPLv3+)
Classifier: License :: OSI Approved :: GNU General Public License v3 or later (GPLv3+)
Classifier: Operating System :: POSIX :: Linux
Classifier: Operating System :: MacOS :: MacOS X
Classifier: Environment :: GPU
Classifier: Environment :: GPU :: NVIDIA CUDA
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.6
Requires-Python: >=3.7
Description-Content-Type: text/markdown
Provides-Extra: spacy
License-File: LICENSE
# Auto-Research
![Auto-Research][logo]
[logo]: https://github.com/sidphbot/Auto-Research/blob/main/logo.png
A no-code utility to generate a detailed well-cited survey with topic clustered sections (draft paper format) and other interesting artifacts from a single research query.
Data Provider: [arXiv](https://arxiv.org/) Open Archive Initiative OAI
Requirements:
- python 3.7 or above
- poppler-utils - `sudo apt-get install build-essential libpoppler-cpp-dev pkg-config python-dev`
- list of requirements in requirements.txt - `cat requirements.txt | xargs pip install`
- 8GB disk space
- 13GB CUDA(GPU) memory - for a survey of 100 searched papers(max_search) and 25 selected papers(num_papers)
#### Demo :
Video Demo : https://drive.google.com/file/d/1-77J2L10lsW-bFDOGdTaPzSr_utY743g/view?usp=sharing
Kaggle Re-usable Demo : https://www.kaggle.com/sidharthpal/auto-research-generate-survey-from-query
(`[TIP]` click 'edit and run' to run the demo for your custom queries on a free GPU)
#### Steps to run (pip coming soon):
```
apt install -y poppler-utils libpoppler-cpp-dev
git clone https://github.com/sidphbot/Auto-Research.git
cd Auto-Research/
pip install -r requirements.txt
python survey.py [options] <your_research_query>
```
#### Artifacts generated (zipped):
- Detailed survey draft paper as txt file
- A curated list of top 25+ papers as pdfs and txts
- Images extracted from above papers as jpegs, bmps etc
- Heading/Section wise highlights extracted from above papers as a re-usable pure python joblib dump
- Tables extracted from papers(optional)
- Corpus of metadata highlights/text of top 100 papers as a re-usable pure python joblib dump
## Example run #1 - python utility
```
python survey.py 'multi-task representation learning'
```
## Example run #2 - python class
```
from survey import Surveyor
mysurveyor = Surveyor()
mysurveyor.survey('quantum entanglement')
```
### Research tools:
These are independent tools for your research or document text handling needs.
```
*[Tip]* :(models can be changed in defaults or passed on during init along with `refresh-models=True`)
```
- `abstractive_summary` - takes a long text document (`string`) and returns a 1-paragraph abstract or “abstractive” summary (`string`)
Input:
`longtext` : string
Returns:
`summary` : string
- `extractive_summary` - takes a long text document (`string`) and returns a 1-paragraph of extracted highlights or “extractive” summary (`string`)
Input:
`longtext` : string
Returns:
`summary` : string
- `generate_title` - takes a long text document (`string`) and returns a generated title (`string`)
Input:
`longtext` : string
Returns:
`title` : string
- `extractive_highlights` - takes a long text document (`string`) and returns a list of extracted highlights (`[string]`), a list of keywords (`[string]`) and key phrases (`[string]`)
Input:
`longtext` : string
Returns:
`highlights` : [string]
`keywords` : [string]
`keyphrases` : [string]
- `extract_images_from_file` - takes a pdf file name (`string`) and returns a list of image filenames (`[string]`).
Input:
`pdf_file` : string
Returns:
`images_files` : [string]
- `extract_tables_from_file` - takes a pdf file name (`string`) and returns a list of csv filenames (`[string]`).
Input:
`pdf_file` : string
Returns:
`images_files` : [string]
- `cluster_lines` - takes a list of lines (`string`) and returns the topic-clustered sections (`dict(generated_title: [cluster_abstract])`) and clustered lines (`dict(cluster_id: [cluster_lines])`)
Input:
`lines` : [string]
Returns:
`sections` : dict(generated_title: [cluster_abstract])
`clusters` : dict(cluster_id: [cluster_lines])
- `extract_headings` - *[for scientific texts - Assumes an ‘abstract’ heading present]* takes a text file name (`string`) and returns a list of headings (`[string]`) and refined lines (`[string]`).
`[Tip 1]` : Use `extract_sections` as a wrapper (e.g. `extract_sections(extract_headings(“/path/to/textfile”)`) to get heading-wise sectioned text with refined lines instead (`dict( heading: text)`)
`[Tip 2]` : write the word ‘abstract’ at the start of the file text to get an extraction for non-scientific texts as well !!
Input:
`text_file` : string
Returns:
`refined` : [string],
`headings` : [string]
`sectioned_doc` : dict( heading: text) (Optional - Wrapper case)
## Access/Modify defaults:
- inside code
```
from survey.Surveyor import DEFAULTS
from pprint import pprint
pprint(DEFAULTS)
```
or,
- Modify static config file - `defaults.py`
or,
- At runtime (utility)
```
python survey.py --help
```
```
usage: survey.py [-h] [--max_search max_metadata_papers]
[--num_papers max_num_papers] [--pdf_dir pdf_dir]
[--txt_dir txt_dir] [--img_dir img_dir] [--tab_dir tab_dir]
[--dump_dir dump_dir] [--models_dir save_models_dir]
[--title_model_name title_model_name]
[--ex_summ_model_name extractive_summ_model_name]
[--ledmodel_name ledmodel_name]
[--embedder_name sentence_embedder_name]
[--nlp_name spacy_model_name]
[--similarity_nlp_name similarity_nlp_name]
[--kw_model_name kw_model_name]
[--refresh_models refresh_models] [--high_gpu high_gpu]
query_string
Generate a survey just from a query !!
positional arguments:
query_string your research query/keywords
optional arguments:
-h, --help show this help message and exit
--max_search max_metadata_papers
maximium number of papers to gaze at - defaults to 100
--num_papers max_num_papers
maximium number of papers to download and analyse -
defaults to 25
--pdf_dir pdf_dir pdf paper storage directory - defaults to
arxiv_data/tarpdfs/
--txt_dir txt_dir text-converted paper storage directory - defaults to
arxiv_data/fulltext/
--img_dir img_dir image storage directory - defaults to
arxiv_data/images/
--tab_dir tab_dir tables storage directory - defaults to
arxiv_data/tables/
--dump_dir dump_dir all_output_dir - defaults to arxiv_dumps/
--models_dir save_models_dir
directory to save models (> 5GB) - defaults to
saved_models/
--title_model_name title_model_name
title model name/tag in hugging-face, defaults to
'Callidior/bert2bert-base-arxiv-titlegen'
--ex_summ_model_name extractive_summ_model_name
extractive summary model name/tag in hugging-face,
defaults to 'allenai/scibert_scivocab_uncased'
--ledmodel_name ledmodel_name
led model(for abstractive summary) name/tag in
hugging-face, defaults to 'allenai/led-
large-16384-arxiv'
--embedder_name sentence_embedder_name
sentence embedder name/tag in hugging-face, defaults
to 'paraphrase-MiniLM-L6-v2'
--nlp_name spacy_model_name
spacy model name/tag in hugging-face (if changed -
needs to be spacy-installed prior), defaults to
'en_core_sci_scibert'
--similarity_nlp_name similarity_nlp_name
spacy downstream model(for similarity) name/tag in
hugging-face (if changed - needs to be spacy-installed
prior), defaults to 'en_core_sci_lg'
--kw_model_name kw_model_name
keyword extraction model name/tag in hugging-face,
defaults to 'distilbert-base-nli-mean-tokens'
--refresh_models refresh_models
Refresh model downloads with given names (needs
atleast one model name param above), defaults to False
--high_gpu high_gpu High GPU usage permitted, defaults to False
```
- At runtime (code)
> during surveyor object initialization with `surveyor_obj = Surveyor()`
- `pdf_dir`: String, pdf paper storage directory - defaults to `arxiv_data/tarpdfs/`
- `txt_dir`: String, text-converted paper storage directory - defaults to `arxiv_data/fulltext/`
- `img_dir`: String, image image storage directory - defaults to `arxiv_data/images/`
- `tab_dir`: String, tables storage directory - defaults to `arxiv_data/tables/`
- `dump_dir`: String, all_output_dir - defaults to `arxiv_dumps/`
- `models_dir`: String, directory to save to huge models, defaults to `saved_models/`
- `title_model_name`: String, title model name/tag in hugging-face, defaults to `Callidior/bert2bert-base-arxiv-titlegen`
- `ex_summ_model_name`: String, extractive summary model name/tag in hugging-face, defaults to `allenai/scibert_scivocab_uncased`
- `ledmodel_name`: String, led model(for abstractive summary) name/tag in hugging-face, defaults to `allenai/led-large-16384-arxiv`
- `embedder_name`: String, sentence embedder name/tag in hugging-face, defaults to `paraphrase-MiniLM-L6-v2`
- `nlp_name`: String, spacy model name/tag in hugging-face (if changed - needs to be spacy-installed prior), defaults to `en_core_sci_scibert`
- `similarity_nlp_name`: String, spacy downstream trained model(for similarity) name/tag in hugging-face (if changed - needs to be spacy-installed prior), defaults to `en_core_sci_lg`
- `kw_model_name`: String, keyword extraction model name/tag in hugging-face, defaults to `distilbert-base-nli-mean-tokens`
- `high_gpu`: Bool, High GPU usage permitted, defaults to `False`
- `refresh_models`: Bool, Refresh model downloads with given names (needs atleast one model name param above), defaults to False
> during survey generation with `surveyor_obj.survey(query="my_research_query")`
- `max_search`: int maximium number of papers to gaze at - defaults to `100`
- `num_papers`: int maximium number of papers to download and analyse - defaults to `25`
|