fadliaulawi's picture
Update README
28b6169
metadata
title: NutriGenMe PaperExtractor
emoji: 📄
colorFrom: green
colorTo: blue
sdk: docker
pinned: false
license: apache-2.0
app_port: 8501

NutriGenMe Paper Extractor

Overview

The NutriGenMe Paper Extractor is a tool designed to extract relevant information from genomic papers related to the NutriGenMe project. It utilizes natural language processing techniques to parse through documents and extract key data points, enabling researchers and practitioners to efficiently gather insights from a large corpus of literature.

Features

  • Automated Extraction: Extracts various entities, such as title, authors, and conclusion of the study, from academic papers automatically.
  • Fast Extraction: Capable of extracting information from complex papers in under 10 minutes.
  • Table Extraction: Extracts values from tables, particularly focusing on gene names, SNPs, and associated diseases.
  • Export to Excel: Export extraction results to Excel format for easy integration and further analysis.

Usage

  1. Clone this repository:
git clone https://github.com/KalbeDigitalLab/nutrigenme-paper-extractor
  1. Install dependencies:
pip install -r requirements.txt
  1. Prepare environment keys:
# Credentials for LLM Models
OPENAI_API_KEY=<api_key>
GOOGLE_API_KEY=<api_key>
PERPLEXITY_API_KEY=<api_key>

# (Optional) Tracking your extraction process with LangSmith
LANGCHAIN_TRACING_V2='true'
LANGCHAIN_API_KEY=<langchain_api_key>
LANGCHAIN_ENDPOINT='https://api.smith.langchain.com'
LANGCHAIN_PROJECT=<project_name>
  1. Run the application with streamlit:
streamlit run app.py

This program is also already deployed in 🤗HuggingFace Space.

Documentation

app.py: Designs the user interface and guides the application flow, calling on other scripts for specific tasks.

process.py: Orchestrates the information extraction by delegating tasks to other scripts and handling the overall workflow.

prompt.py: Stores prompts crafted for Large Language Models (LLMs) to target specific information during extraction.

table_detector.py: Focuses on extracting info from Optical Character Recognition (OCR) tables, using functions to detect and process them.

Contributing

Contributions are welcome! If you'd like to contribute to this project, feel free to create pull requests.