Spaces:
Build error
Build error
Metadata-Version: 2.1 | |
Name: Auto-Research | |
Version: 1.0 | |
Summary: Geberate scientific survey with just a query | |
Home-page: https://github.com/sidphbot/Auto-Research | |
Author: Sidharth Pal | |
Author-email: sidharth.pal1992@gmail.com | |
License: UNKNOWN | |
Project-URL: Docs, https://github.com/example/example/README.md | |
Project-URL: Bug Tracker, https://github.com/sidphbot/Auto-Research/issues | |
Project-URL: Demo, https://www.kaggle.com/sidharthpal/auto-research-generate-survey-from-query | |
Platform: UNKNOWN | |
Classifier: Development Status :: 5 - Production/Stable | |
Classifier: Environment :: Console | |
Classifier: Environment :: Other Environment | |
Classifier: Intended Audience :: Developers | |
Classifier: Intended Audience :: Education | |
Classifier: Intended Audience :: Science/Research | |
Classifier: Intended Audience :: Other Audience | |
Classifier: Topic :: Education | |
Classifier: Topic :: Education :: Computer Aided Instruction (CAI) | |
Classifier: Topic :: Scientific/Engineering | |
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence | |
Classifier: Topic :: Scientific/Engineering :: Information Analysis | |
Classifier: Topic :: Scientific/Engineering :: Medical Science Apps. | |
Classifier: Topic :: Scientific/Engineering :: Physics | |
Classifier: Natural Language :: English | |
Classifier: License :: OSI Approved :: GNU General Public License (GPL) | |
Classifier: License :: OSI Approved :: GNU Library or Lesser General Public License (LGPL) | |
Classifier: License :: OSI Approved :: GNU Lesser General Public License v3 or later (LGPLv3+) | |
Classifier: License :: OSI Approved :: GNU General Public License v3 or later (GPLv3+) | |
Classifier: Operating System :: POSIX :: Linux | |
Classifier: Operating System :: MacOS :: MacOS X | |
Classifier: Environment :: GPU | |
Classifier: Environment :: GPU :: NVIDIA CUDA | |
Classifier: Programming Language :: Python | |
Classifier: Programming Language :: Python :: 3 | |
Classifier: Programming Language :: Python :: 3 :: Only | |
Classifier: Programming Language :: Python :: 3.6 | |
Requires-Python: >=3.7 | |
Description-Content-Type: text/markdown | |
Provides-Extra: spacy | |
License-File: LICENSE | |
# Auto-Research | |
![Auto-Research][logo] | |
[logo]: https://github.com/sidphbot/Auto-Research/blob/main/logo.png | |
A no-code utility to generate a detailed well-cited survey with topic clustered sections (draft paper format) and other interesting artifacts from a single research query. | |
Data Provider: [arXiv](https://arxiv.org/) Open Archive Initiative OAI | |
Requirements: | |
- python 3.7 or above | |
- poppler-utils - `sudo apt-get install build-essential libpoppler-cpp-dev pkg-config python-dev` | |
- list of requirements in requirements.txt - `cat requirements.txt | xargs pip install` | |
- 8GB disk space | |
- 13GB CUDA(GPU) memory - for a survey of 100 searched papers(max_search) and 25 selected papers(num_papers) | |
#### Demo : | |
Video Demo : https://drive.google.com/file/d/1-77J2L10lsW-bFDOGdTaPzSr_utY743g/view?usp=sharing | |
Kaggle Re-usable Demo : https://www.kaggle.com/sidharthpal/auto-research-generate-survey-from-query | |
(`[TIP]` click 'edit and run' to run the demo for your custom queries on a free GPU) | |
#### Steps to run (pip coming soon): | |
``` | |
apt install -y poppler-utils libpoppler-cpp-dev | |
git clone https://github.com/sidphbot/Auto-Research.git | |
cd Auto-Research/ | |
pip install -r requirements.txt | |
python survey.py [options] <your_research_query> | |
``` | |
#### Artifacts generated (zipped): | |
- Detailed survey draft paper as txt file | |
- A curated list of top 25+ papers as pdfs and txts | |
- Images extracted from above papers as jpegs, bmps etc | |
- Heading/Section wise highlights extracted from above papers as a re-usable pure python joblib dump | |
- Tables extracted from papers(optional) | |
- Corpus of metadata highlights/text of top 100 papers as a re-usable pure python joblib dump | |
## Example run #1 - python utility | |
``` | |
python survey.py 'multi-task representation learning' | |
``` | |
## Example run #2 - python class | |
``` | |
from survey import Surveyor | |
mysurveyor = Surveyor() | |
mysurveyor.survey('quantum entanglement') | |
``` | |
### Research tools: | |
These are independent tools for your research or document text handling needs. | |
``` | |
*[Tip]* :(models can be changed in defaults or passed on during init along with `refresh-models=True`) | |
``` | |
- `abstractive_summary` - takes a long text document (`string`) and returns a 1-paragraph abstract or “abstractive” summary (`string`) | |
Input: | |
`longtext` : string | |
Returns: | |
`summary` : string | |
- `extractive_summary` - takes a long text document (`string`) and returns a 1-paragraph of extracted highlights or “extractive” summary (`string`) | |
Input: | |
`longtext` : string | |
Returns: | |
`summary` : string | |
- `generate_title` - takes a long text document (`string`) and returns a generated title (`string`) | |
Input: | |
`longtext` : string | |
Returns: | |
`title` : string | |
- `extractive_highlights` - takes a long text document (`string`) and returns a list of extracted highlights (`[string]`), a list of keywords (`[string]`) and key phrases (`[string]`) | |
Input: | |
`longtext` : string | |
Returns: | |
`highlights` : [string] | |
`keywords` : [string] | |
`keyphrases` : [string] | |
- `extract_images_from_file` - takes a pdf file name (`string`) and returns a list of image filenames (`[string]`). | |
Input: | |
`pdf_file` : string | |
Returns: | |
`images_files` : [string] | |
- `extract_tables_from_file` - takes a pdf file name (`string`) and returns a list of csv filenames (`[string]`). | |
Input: | |
`pdf_file` : string | |
Returns: | |
`images_files` : [string] | |
- `cluster_lines` - takes a list of lines (`string`) and returns the topic-clustered sections (`dict(generated_title: [cluster_abstract])`) and clustered lines (`dict(cluster_id: [cluster_lines])`) | |
Input: | |
`lines` : [string] | |
Returns: | |
`sections` : dict(generated_title: [cluster_abstract]) | |
`clusters` : dict(cluster_id: [cluster_lines]) | |
- `extract_headings` - *[for scientific texts - Assumes an ‘abstract’ heading present]* takes a text file name (`string`) and returns a list of headings (`[string]`) and refined lines (`[string]`). | |
`[Tip 1]` : Use `extract_sections` as a wrapper (e.g. `extract_sections(extract_headings(“/path/to/textfile”)`) to get heading-wise sectioned text with refined lines instead (`dict( heading: text)`) | |
`[Tip 2]` : write the word ‘abstract’ at the start of the file text to get an extraction for non-scientific texts as well !! | |
Input: | |
`text_file` : string | |
Returns: | |
`refined` : [string], | |
`headings` : [string] | |
`sectioned_doc` : dict( heading: text) (Optional - Wrapper case) | |
## Access/Modify defaults: | |
- inside code | |
``` | |
from survey.Surveyor import DEFAULTS | |
from pprint import pprint | |
pprint(DEFAULTS) | |
``` | |
or, | |
- Modify static config file - `defaults.py` | |
or, | |
- At runtime (utility) | |
``` | |
python survey.py --help | |
``` | |
``` | |
usage: survey.py [-h] [--max_search max_metadata_papers] | |
[--num_papers max_num_papers] [--pdf_dir pdf_dir] | |
[--txt_dir txt_dir] [--img_dir img_dir] [--tab_dir tab_dir] | |
[--dump_dir dump_dir] [--models_dir save_models_dir] | |
[--title_model_name title_model_name] | |
[--ex_summ_model_name extractive_summ_model_name] | |
[--ledmodel_name ledmodel_name] | |
[--embedder_name sentence_embedder_name] | |
[--nlp_name spacy_model_name] | |
[--similarity_nlp_name similarity_nlp_name] | |
[--kw_model_name kw_model_name] | |
[--refresh_models refresh_models] [--high_gpu high_gpu] | |
query_string | |
Generate a survey just from a query !! | |
positional arguments: | |
query_string your research query/keywords | |
optional arguments: | |
-h, --help show this help message and exit | |
--max_search max_metadata_papers | |
maximium number of papers to gaze at - defaults to 100 | |
--num_papers max_num_papers | |
maximium number of papers to download and analyse - | |
defaults to 25 | |
--pdf_dir pdf_dir pdf paper storage directory - defaults to | |
arxiv_data/tarpdfs/ | |
--txt_dir txt_dir text-converted paper storage directory - defaults to | |
arxiv_data/fulltext/ | |
--img_dir img_dir image storage directory - defaults to | |
arxiv_data/images/ | |
--tab_dir tab_dir tables storage directory - defaults to | |
arxiv_data/tables/ | |
--dump_dir dump_dir all_output_dir - defaults to arxiv_dumps/ | |
--models_dir save_models_dir | |
directory to save models (> 5GB) - defaults to | |
saved_models/ | |
--title_model_name title_model_name | |
title model name/tag in hugging-face, defaults to | |
'Callidior/bert2bert-base-arxiv-titlegen' | |
--ex_summ_model_name extractive_summ_model_name | |
extractive summary model name/tag in hugging-face, | |
defaults to 'allenai/scibert_scivocab_uncased' | |
--ledmodel_name ledmodel_name | |
led model(for abstractive summary) name/tag in | |
hugging-face, defaults to 'allenai/led- | |
large-16384-arxiv' | |
--embedder_name sentence_embedder_name | |
sentence embedder name/tag in hugging-face, defaults | |
to 'paraphrase-MiniLM-L6-v2' | |
--nlp_name spacy_model_name | |
spacy model name/tag in hugging-face (if changed - | |
needs to be spacy-installed prior), defaults to | |
'en_core_sci_scibert' | |
--similarity_nlp_name similarity_nlp_name | |
spacy downstream model(for similarity) name/tag in | |
hugging-face (if changed - needs to be spacy-installed | |
prior), defaults to 'en_core_sci_lg' | |
--kw_model_name kw_model_name | |
keyword extraction model name/tag in hugging-face, | |
defaults to 'distilbert-base-nli-mean-tokens' | |
--refresh_models refresh_models | |
Refresh model downloads with given names (needs | |
atleast one model name param above), defaults to False | |
--high_gpu high_gpu High GPU usage permitted, defaults to False | |
``` | |
- At runtime (code) | |
> during surveyor object initialization with `surveyor_obj = Surveyor()` | |
- `pdf_dir`: String, pdf paper storage directory - defaults to `arxiv_data/tarpdfs/` | |
- `txt_dir`: String, text-converted paper storage directory - defaults to `arxiv_data/fulltext/` | |
- `img_dir`: String, image image storage directory - defaults to `arxiv_data/images/` | |
- `tab_dir`: String, tables storage directory - defaults to `arxiv_data/tables/` | |
- `dump_dir`: String, all_output_dir - defaults to `arxiv_dumps/` | |
- `models_dir`: String, directory to save to huge models, defaults to `saved_models/` | |
- `title_model_name`: String, title model name/tag in hugging-face, defaults to `Callidior/bert2bert-base-arxiv-titlegen` | |
- `ex_summ_model_name`: String, extractive summary model name/tag in hugging-face, defaults to `allenai/scibert_scivocab_uncased` | |
- `ledmodel_name`: String, led model(for abstractive summary) name/tag in hugging-face, defaults to `allenai/led-large-16384-arxiv` | |
- `embedder_name`: String, sentence embedder name/tag in hugging-face, defaults to `paraphrase-MiniLM-L6-v2` | |
- `nlp_name`: String, spacy model name/tag in hugging-face (if changed - needs to be spacy-installed prior), defaults to `en_core_sci_scibert` | |
- `similarity_nlp_name`: String, spacy downstream trained model(for similarity) name/tag in hugging-face (if changed - needs to be spacy-installed prior), defaults to `en_core_sci_lg` | |
- `kw_model_name`: String, keyword extraction model name/tag in hugging-face, defaults to `distilbert-base-nli-mean-tokens` | |
- `high_gpu`: Bool, High GPU usage permitted, defaults to `False` | |
- `refresh_models`: Bool, Refresh model downloads with given names (needs atleast one model name param above), defaults to False | |
> during survey generation with `surveyor_obj.survey(query="my_research_query")` | |
- `max_search`: int maximium number of papers to gaze at - defaults to `100` | |
- `num_papers`: int maximium number of papers to download and analyse - defaults to `25` | |