metadata

title: LoomRAG
emoji: 🏆
colorFrom: indigo
colorTo: pink
sdk: streamlit
sdk_version: 1.41.1
app_file: app.py
pinned: false
license: apache-2.0
short_description: 🧠 Multimodal RAG that "weaves" together text and images 🪡

🌟 LoomRAG: Multimodal Retrieval-Augmented Generation for AI-Powered Search

This project implements a Multimodal Retrieval-Augmented Generation (RAG) system, named LoomRAG, that leverages OpenAI's CLIP model for neural cross-modal retrieval and semantic search. The system allows users to input text queries and retrieve both text and image responses seamlessly through vector embeddings. It features a comprehensive annotation interface for creating custom datasets and supports CLIP model fine-tuning with configurable parameters for domain-specific applications. The system also supports uploading images and PDFs for enhanced interaction and intelligent retrieval capabilities through a Streamlit-based interface.

Experience the project in action:

📸 Implementation Screenshots


Data Upload Page	Data Search / Retrieval


Data Annotation Page	CLIP Fine-Tuning

✨ Features

🔄 Cross-Modal Retrieval: Search text to retrieve both text and image results using deep learning
🌐 Streamlit Interface: Provides a user-friendly web interface for interacting with the system
📤 Upload Options: Allows users to upload images and PDFs for AI-powered processing and retrieval
🧠 Embedding-Based Search: Uses OpenAI's CLIP model to align text and image embeddings in a shared latent space
🔍 Augmented Text Generation: Enhances text results using LLMs for contextually rich outputs
🏷️ Image Annotation: Enables users to annotate uploaded images through an intuitive interface
🎯 CLIP Fine-Tuning: Supports custom model training with configurable parameters including test dataset split size, learning rate, optimizer, and weight decay
🔨 Fine-Tuned Model Integration: Seamlessly load and utilize fine-tuned CLIP models for enhanced search and retrieval

🏗️ Architecture Overview

Data Indexing:
- Text, images, and PDFs are preprocessed and embedded using the CLIP model
- Embeddings are stored in a vector database for fast and efficient retrieval
Query Processing:
- Text queries are converted into embeddings for semantic search
- Uploaded images and PDFs are processed and embedded for comparison
- The system performs a nearest neighbor search in the vector database to retrieve relevant text and images
Response Generation:
- For text results: Optionally refined or augmented using a language model
- For image results: Directly returned or enhanced with image captions
- For PDFs: Extracts text content and provides relevant sections
Image Annotation:
- Dedicated annotation page for managing uploaded images
- Support for creating and managing multiple datasets simultaneously
- Flexible annotation workflow for efficient data labeling
- Dataset organization and management capabilities
Model Fine-Tuning:
- Custom CLIP model training on annotated images
- Configurable training parameters for optimization
- Integration of fine-tuned models into the search pipeline

🚀 Installation

Clone the repository:

git clone https://github.com/NotShrirang/LoomRAG.git
cd LoomRAG

Create a virtual environment and install dependencies:
```
pip install -r requirements.txt
```

📖 Usage

Running the Streamlit Interface:
- Start the Streamlit app:
```
streamlit run app.py
```
- Access the interface in your browser to:
  - Submit natural language queries
  - Upload images or PDFs to retrieve contextually relevant results
  - Annotate uploaded images
  - Fine-tune CLIP models with custom parameters
  - Use fine-tuned models for improved search results
Example Queries:
- Text Query: "sunset over mountains"
  Output: An image of a sunset over mountains along with descriptive text
- PDF Upload: Upload a PDF of a scientific paper
  Output: Extracted key sections or contextually relevant images

⚙️ Configuration

📊 Vector Database: It uses FAISS for efficient similarity search
🤖 Model: Uses OpenAI CLIP for neural embedding generation
✍️ Augmentation: Optional LLM-based augmentation for text responses
🎛️ Fine-Tuning: Configurable parameters for model training and optimization

🗺️ Roadmap

Fine-tuning CLIP for domain-specific datasets
Adding support for audio and video modalities
Improving the re-ranking system for better contextual relevance
Enhanced PDF parsing with semantic section segmentation

🤝 Contributing

Contributions are welcome! Please open an issue or submit a pull request for any feature requests or bug fixes.

📄 License

This project is licensed under the Apache-2.0 License. See the LICENSE file for details.

Spaces:

NotShrirang
/

LoomRAG

Sleeping