File size: 2,447 Bytes
fb4710e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28b6169
 
 
 
 
fb4710e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
---
title: NutriGenMe PaperExtractor
emoji: 📄
colorFrom: green
colorTo: blue
sdk: docker
pinned: false
license: apache-2.0
app_port: 8501
---

# NutriGenMe Paper Extractor

## Overview
The NutriGenMe Paper Extractor is a tool designed to extract relevant information from genomic papers related to the NutriGenMe project. It utilizes natural language processing techniques to parse through documents and extract key data points, enabling researchers and practitioners to efficiently gather insights from a large corpus of literature.

## Features
- **Automated Extraction**: Extracts various entities, such as title, authors, and conclusion of the study, from academic papers automatically.
- **Fast Extraction**: Capable of extracting information from complex papers in under 10 minutes.
- **Table Extraction**: Extracts values from tables, particularly focusing on gene names, SNPs, and associated diseases.
- **Export to Excel**: Export extraction results to Excel format for easy integration and further analysis.

## Usage
1. Clone this repository:
```bash
git clone https://github.com/KalbeDigitalLab/nutrigenme-paper-extractor
```

2. Install dependencies:
```bash
pip install -r requirements.txt
```

3. Prepare environment keys:
```dosini
# Credentials for LLM Models
OPENAI_API_KEY=<api_key>
GOOGLE_API_KEY=<api_key>
PERPLEXITY_API_KEY=<api_key>

# (Optional) Tracking your extraction process with LangSmith
LANGCHAIN_TRACING_V2='true'
LANGCHAIN_API_KEY=<langchain_api_key>
LANGCHAIN_ENDPOINT='https://api.smith.langchain.com'
LANGCHAIN_PROJECT=<project_name>
```
4. Run the application with `streamlit`:
```bash
streamlit run app.py
```

This program is also already deployed in 🤗HuggingFace [Space](https://huggingface.co/spaces/KalbeDigitalLab/nutrigenme-paper-extractor/).

## Documentation
**app.py**: Designs the user interface and guides the application flow, calling on other scripts for specific tasks.

**process.py**: Orchestrates the information extraction by delegating tasks to other scripts and handling the overall workflow.

**prompt.py**: Stores prompts crafted for Large Language Models (LLMs) to target specific information during extraction.

**table_detector.py**: Focuses on extracting info from Optical Character Recognition (OCR) tables, using functions to detect and process them.

## Contributing
Contributions are welcome! If you'd like to contribute to this project, feel free to create pull requests.