Spaces:
Sleeping
DataScribe: AI-Powered Information Extraction
DataScribe is an intelligent AI agent designed to streamline data retrieval, extraction, and structuring. By harnessing the power of Large Language Models (LLMs) and automated web search capabilities, it enables users to extract actionable insights from datasets with minimal effort. Designed for efficiency, scalability, and user-friendliness, DataScribe is ideal for professionals handling large datasets or requiring quick access to structured information.
🚀 Key Features
Core Functionalities
File Upload & Integration
- Upload datasets directly from CSV files.
- Google Sheets Integration: Seamlessly connect and interact with Google Sheets.
Custom Query Definition
- Define intuitive query templates for extracting data.
- Advanced Query Templates: Extract multiple fields simultaneously, e.g., "Find the email and address for {company}."
Automated Information Retrieval
- LLM-Powered Extraction: Uses ChatGroq for LLM processing and Serper API for web searches.
- Retry Mechanism: Handles failed queries with robust retries for accurate results.
Interactive Results Dashboard
- View extracted data in a clean, dynamic, and filterable table view.
Export & Update Options
- Download results as CSV or directly update Google Sheets.
🛠️ Technology Stack
Component | Technologies |
---|---|
Dashboard/UI | Streamlit |
Data Handling | pandas, Google Sheets API (Auth0, gspread) |
Search API | Serper API, ScraperAPI |
LLM API | Groq API |
Backend | Python |
Agents | LangChain |
📂 Repository Structure
DataScribe/
├── app.py # Main application entry point
├── funcs/ # Core functionalities
│ ├── googlesheet.py # Google Sheets integration
│ ├── llm.py # LLM-based extraction and search
├── views/ # UI components and layout
│ ├── home.py # Home page and navigation
│ ├── upload_data.py # File upload and data preprocessing
│ ├── define_query.py # Query definition logic
│ ├── extract_information.py # Information extraction workflows
│ ├── view_and_download.py # Result viewing and export functionalities
├── requirements.txt # Dependency list
├── .env.sample # Environment variable template
├── credentialsample.json # Google API credentials template
├── README.md # Documentation
├── LICENSE # License information
📖 Setup Instructions
Prerequisites
- Python 3.9 or higher.
- Google API credentials for Sheets integration.
Installation Steps
Clone the Repository
git clone https://github.com/sam22ridhi/DataScribe.git cd DataScribe
Install Dependencies
pip install -r requirements.txt
Set Up Environment Variables
- Copy the
.env.sample
file to.env
:cp .env.sample .env
- Add the required API keys to the
.env
file:GOOGLE_API_KEY=<your_google_api_key> SERPER_API_KEY=<your_serper_api_key>
- Copy the
Prepare Google API Credentials
- Replace the content in
credentialsample.json
with your Google API credentials and save it ascredentials.json
.
- Replace the content in
Run the Application
streamlit run app.py
Access the Application
Open http://localhost:8501 in your browser.
🛠️ Usage Guide
Upload Data
Navigate to the Upload Data tab to import a CSV file or connect to Google Sheets.Define Query
Use the Define Query tab to specify search templates. Select the column containing the entities and define fields to extract.Extract Information
Execute automated searches in the Extract Information tab to fetch structured data.View & Download
Review the results in the View & Download tab, then export as CSV or update Google Sheets directly.
🌟 Screenshots
Home Page
File Upload
Define Query
Extracted Data
Running the Application
View & Download Results
📝 Loom Video Walkthrough
Watch the 2-minute walkthrough showcasing:
- Overview of DataScribe's purpose and features.
- Key workflows, including upload, extraction, and export.
- Code features
📝 Hugging Face Tryout
Try out on huggin face link
🙌 Acknowledgements
Special thanks to Breakout AI and Kapil Mittal for their opportunity to demonstrate my skills through this project/assessment.
📜 License
This project is licensed under the Apache License 2.0.
🤝 Contributing
We welcome contributions!
- Fork the repository.
- Create a feature branch.
- Submit a pull request with a detailed description of changes.
📬 Contact
For feedback or support: