Spaces:

geekyrakshit
/

medrag

Sleeping

App Files Files Community

geekyrakshit commited on Oct 20, 2024

Commit

bf14736

unverified ·

2 Parent(s): 694a076 ff75fe0

Merge pull request #14 from soumik12345/feat/ensemble-of-image-loaders

Browse files

Files changed (22) hide show

docs/document_loader/image_loader/base_img_loader.md +3 -0
docs/document_loader/image_loader/fitzpil_img_loader.md +22 -0
docs/document_loader/image_loader/marker_img_loader.md +21 -0
docs/document_loader/image_loader/pdf2image_img_loader.md +26 -0
docs/document_loader/image_loader/pdfplumber_img_loader.md +22 -0
docs/document_loader/image_loader/pymupdf_img_loader.md +23 -0
docs/document_loader/load_image.md +0 -3
medrag_multi_modal/document_loader/__init__.py +12 -4
medrag_multi_modal/document_loader/image_loader/__init__.py +13 -0
medrag_multi_modal/document_loader/image_loader/base_img_loader.py +113 -0
medrag_multi_modal/document_loader/image_loader/fitzpil_img_loader.py +127 -0
medrag_multi_modal/document_loader/image_loader/marker_img_loader.py +100 -0
medrag_multi_modal/document_loader/image_loader/pdf2image_img_loader.py +92 -0
medrag_multi_modal/document_loader/image_loader/pdfplumber_img_loader.py +101 -0
medrag_multi_modal/document_loader/image_loader/pymupdf_img_loader.py +124 -0
medrag_multi_modal/document_loader/load_image.py +0 -131
medrag_multi_modal/document_loader/text_loader/marker_text_loader.py +10 -9
medrag_multi_modal/document_loader/text_loader/pdfplumber_text_loader.py +9 -8
medrag_multi_modal/document_loader/text_loader/pymupdf4llm_text_loader.py +9 -8
medrag_multi_modal/document_loader/text_loader/pypdf2_text_loader.py +9 -8
medrag_multi_modal/retrieval/multi_modal_retrieval.py +2 -1
mkdocs.yml +7 -1

docs/document_loader/image_loader/base_img_loader.md ADDED Viewed

	@@ -0,0 +1,3 @@


1	+ ## Load images from PDF files
2	+
3	+ ::: medrag_multi_modal.document_loader.image_loader.base_img_loader

docs/document_loader/image_loader/fitzpil_img_loader.md ADDED Viewed

	@@ -0,0 +1,22 @@

+# Load images from PDF files (using Fitz & PIL)
+??? note "Note"
+    **Underlying Library:** `fitz` & `pillow`
+    Extract images from PDF files using `fitz` and `pillow`.
+    Use it in our library with:
+    ```python
+    from medrag_multi_modal.document_loader.image_loader import FitzPILImageLoader
+    ```
+    For more details, please refer to the sources below.
+    **Sources:**
+    - [Docs](https://pymupdf.readthedocs.io/en/latest/intro.html)
+    - [GitHub](https://github.com/kastman/fitz)
+    - [PyPI](https://pypi.org/project/fitz/)
+    - [PyPI](https://pypi.org/project/pillow/)
+::: medrag_multi_modal.document_loader.image_loader.fitzpil_img_loader

docs/document_loader/image_loader/marker_img_loader.md ADDED Viewed

	@@ -0,0 +1,21 @@

+# Load images from PDF files (using Marker)
+??? note "Note"
+    **Underlying Library:** `marker-pdf`
+    Extract images from PDF files using `marker-pdf`.
+    Use it in our library with:
+    ```python
+    from medrag_multi_modal.document_loader.image_loader import MarkerImageLoader
+    ```
+    For details, please refer to the sources below.
+    **Sources:**
+    - [DataLab](https://www.datalab.to)
+    - [GitHub](https://github.com/VikParuchuri/marker)
+    - [PyPI](https://pypi.org/project/marker-pdf/)
+::: medrag_multi_modal.document_loader.image_loader.marker_img_loader

docs/document_loader/image_loader/pdf2image_img_loader.md ADDED Viewed

	@@ -0,0 +1,26 @@

+# Load images from PDF files (using PDF2Image)
+!!! danger "Warning"
+    Unlike other image extraction methods in `document_loader.image_loader`, this loader does not extract embedded images from the PDF.
+    Instead, it creates a snapshot image version of each selected page from the PDF.
+??? note "Note"
+    **Underlying Library:** `pdf2image`
+    Extract images from PDF files using `pdf2image`.
+    Use it in our library with:
+    ```python
+    from medrag_multi_modal.document_loader.image_loader import PDF2ImageLoader
+    ```
+    For details and available `**kwargs`, please refer to the sources below.
+    **Sources:**
+    - [DataLab](https://www.datalab.to)
+    - [GitHub](https://github.com/VikParuchuri/marker)
+    - [PyPI](https://pypi.org/project/marker-pdf/)
+::: medrag_multi_modal.document_loader.image_loader.pdf2image_img_loader

docs/document_loader/image_loader/pdfplumber_img_loader.md ADDED Viewed

	@@ -0,0 +1,22 @@

+# Load images from PDF files (using PDFPlumber)
+??? note "Note"
+    **Underlying Library:** `pdfplumber`
+    Extract images from PDF files using `pdfplumber`.
+    You can interact with the underlying library and fine-tune the outputs via `**kwargs`.
+    Use it in our library with:
+    ```python
+    from medrag_multi_modal.document_loader.image_loader import PDFPlumberImageLoader
+    ```
+    For details, please refer to the sources below.
+    **Sources:**
+    - [GitHub](https://github.com/jsvine/pdfplumber)
+    - [PyPI](https://pypi.org/project/pdfplumber/)
+::: medrag_multi_modal.document_loader.image_loader.pdfplumber_img_loader

docs/document_loader/image_loader/pymupdf_img_loader.md ADDED Viewed

	@@ -0,0 +1,23 @@

+# Load images from PDF files (using PyMuPDF)
+??? note "Note"
+    **Underlying Library:** `pymupdf`
+    PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
+    You can interact with the underlying library and fine-tune the outputs via `**kwargs`.
+    Use it in our library with:
+    ```python
+    from medrag_multi_modal.document_loader.image_loader import PyMuPDFImageLoader
+    ```
+    For details, please refer to the sources below.
+    **Sources:**
+    - [Docs](https://pymupdf.readthedocs.io/en/latest/)
+    - [GitHub](https://github.com/pymupdf/PyMuPDF)
+    - [PyPI](https://pypi.org/project/PyMuPDF/)
+::: medrag_multi_modal.document_loader.image_loader.pymupdf_img_loader

docs/document_loader/load_image.md DELETED Viewed

@@ -1,3 +0,0 @@
-# Load PDF pages as images
-::: medrag_multi_modal.document_loader.load_image

medrag_multi_modal/document_loader/__init__.py CHANGED Viewed

@@ -1,5 +1,10 @@
-from .load_image import ImageLoader
-from .load_text_image import TextImageLoader
 from .text_loader import (
     MarkerTextLoader,
     PDFPlumberTextLoader,
@@ -12,6 +17,9 @@ __all__ = [
     "PyPDF2TextLoader",
     "PDFPlumberTextLoader",
     "MarkerTextLoader",
-    "ImageLoader",
-    "TextImageLoader",
 ]

+from .image_loader import (
+    FitzPILImageLoader,
+    MarkerImageLoader,
+    PDF2ImageLoader,
+    PDFPlumberImageLoader,
+    PyMuPDFImageLoader,
+)
 from .text_loader import (
     MarkerTextLoader,
     PDFPlumberTextLoader,
     "PyPDF2TextLoader",
     "PDFPlumberTextLoader",
     "MarkerTextLoader",
+    "PDF2ImageLoader",
+    "MarkerImageLoader",
+    "PDFPlumberImageLoader",
+    "PyMuPDFImageLoader",
+    "FitzPILImageLoader",
 ]

medrag_multi_modal/document_loader/image_loader/__init__.py ADDED Viewed

	@@ -0,0 +1,13 @@

+from .fitzpil_img_loader import FitzPILImageLoader
+from .marker_img_loader import MarkerImageLoader
+from .pdf2image_img_loader import PDF2ImageLoader
+from .pdfplumber_img_loader import PDFPlumberImageLoader
+from .pymupdf_img_loader import PyMuPDFImageLoader
+__all__ = [
+    "PDF2ImageLoader",
+    "MarkerImageLoader",
+    "PDFPlumberImageLoader",
+    "PyMuPDFImageLoader",
+    "FitzPILImageLoader",
+]

medrag_multi_modal/document_loader/image_loader/base_img_loader.py ADDED Viewed

	@@ -0,0 +1,113 @@

+import asyncio
+import os
+from abc import abstractmethod
+from typing import Dict, List, Optional
+import rich
+import wandb
+from medrag_multi_modal.document_loader.text_loader.base_text_loader import (
+    BaseTextLoader,
+)
+class BaseImageLoader(BaseTextLoader):
+    def __init__(self, url: str, document_name: str, document_file_path: str):
+        super().__init__(url, document_name, document_file_path)
+    @abstractmethod
+    async def extract_page_data(
+        self, page_idx: int, image_save_dir: str, **kwargs
+    ) -> Dict[str, str]:
+        """
+        Abstract method to process a single page of the PDF and extract the image data.
+        Overwrite this method in the subclass to provide the actual implementation and
+        processing logic for each page of the PDF using various PDF processing libraries.
+        Args:
+            page_idx (int): The index of the page to process.
+            image_save_dir (str): The directory to save the extracted images.
+            **kwargs: Additional keyword arguments that may be used by underlying libraries.
+        Returns:
+            Dict[str, str]: A dictionary containing the processed page data.
+        """
+        pass
+    async def load_data(
+        self,
+        start_page: Optional[int] = None,
+        end_page: Optional[int] = None,
+        wandb_artifact_name: Optional[str] = None,
+        image_save_dir: str = "./images",
+        cleanup: bool = True,
+        **kwargs,
+    ) -> List[Dict[str, str]]:
+        """
+        Asynchronously loads images from a PDF file specified by a URL or local file path.
+        The overrided processing abstract method then processes the images,
+        and optionally publishes it to a WandB artifact.
+        This function downloads a PDF from a given URL if it does not already exist locally,
+        reads the specified range of pages, scans each page's content to extract images, and
+        returns a list of Page objects containing the images and metadata.
+        It uses `PyPDF2` to calculate the number of pages in the PDF and the
+        overriden `extract_page_data` method provides the actual implementation to process
+        each page, extract the image content from the PDF, and convert it to png format.
+        It processes pages concurrently using `asyncio` for efficiency.
+        If a wandb_artifact_name is provided, the processed pages are published to a WandB artifact.
+        Args:
+            start_page (Optional[int]): The starting page index (0-based) to process. Defaults to the first page.
+            end_page (Optional[int]): The ending page index (0-based) to process. Defaults to the last page.
+            wandb_artifact_name (Optional[str]): The name of the WandB artifact to publish the pages to, if provided.
+            image_save_dir (str): The directory to save the extracted images.
+            cleanup (bool): Whether to remove extracted images from `image_save_dir`, if uploading to wandb artifact.
+            **kwargs: Additional keyword arguments that will be passed to extract_page_data method and the underlying library.
+        Returns:
+            List[Dict[str, Any]]: A list of dictionaries, each containing the image and metadata for a processed page.
+            Each dictionary will have the following keys and values:
+            - "page_idx": (int) the index of the page.
+            - "document_name": (str) the name of the document.
+            - "file_path": (str) the local file path where the PDF is stored.
+            - "file_url": (str) the URL of the PDF file.
+            - "image_file_path" or "image_file_paths": (str) the local file path where the image/images are stored.
+        Raises:
+            ValueError: If the specified start_page or end_page is out of bounds of the document's page count.
+        """
+        os.makedirs(image_save_dir, exist_ok=True)
+        start_page, end_page = self.get_page_indices(start_page, end_page)
+        pages = []
+        processed_pages_counter: int = 1
+        total_pages = end_page - start_page
+        async def process_page(page_idx):
+            nonlocal processed_pages_counter
+            page_data = await self.extract_page_data(page_idx, image_save_dir, **kwargs)
+            pages.append(page_data)
+            rich.print(
+                f"Processed page idx: {page_idx}, progress: {processed_pages_counter}/{total_pages}"
+            )
+            processed_pages_counter += 1
+        tasks = [process_page(page_idx) for page_idx in range(start_page, end_page)]
+        for task in asyncio.as_completed(tasks):
+            await task
+        if wandb_artifact_name:
+            artifact = wandb.Artifact(name=wandb_artifact_name, type="dataset")
+            artifact.add_dir(local_path=image_save_dir)
+            artifact.save()
+            rich.print("Artifact saved and uploaded to wandb!")
+        if cleanup:
+            for file in os.listdir(image_save_dir):
+                file_path = os.path.join(image_save_dir, file)
+                if os.path.isfile(file_path):
+                    os.remove(file_path)
+        return pages

medrag_multi_modal/document_loader/image_loader/fitzpil_img_loader.py ADDED Viewed

	@@ -0,0 +1,127 @@

+import io
+import os
+from typing import Any, Dict
+import fitz
+from PIL import Image, ImageOps, UnidentifiedImageError
+from .base_img_loader import BaseImageLoader
+class FitzPILImageLoader(BaseImageLoader):
+    """
+    `FitzPILImageLoader` is a class that extends the `BaseImageLoader` class to handle the extraction and
+    loading of pages from a PDF file as images using the fitz and PIL libraries.
+    This class provides functionality to extract images from a PDF file using fitz and PIL libraries,
+    and optionally publish these images to a WandB artifact.
+    !!! example "Example Usage"
+        ```python
+        import asyncio
+        import weave
+        import wandb
+        from medrag_multi_modal.document_loader.image_loader import FitzPILImageLoader
+        weave.init(project_name="ml-colabs/medrag-multi-modal")
+        wandb.init(project="medrag-multi-modal", entity="ml-colabs")
+        url = "https://archive.org/download/GraysAnatomy41E2015PDF/Grays%20Anatomy-41%20E%20%282015%29%20%5BPDF%5D.pdf"
+        loader = FitzPILImageLoader(
+            url=url,
+            document_name="Gray's Anatomy",
+            document_file_path="grays_anatomy.pdf",
+        )
+        asyncio.run(
+            loader.load_data(
+                start_page=32,
+                end_page=37,
+                wandb_artifact_name="grays-anatomy-images-fitzpil",
+                cleanup=False,
+            )
+        )
+        ```
+    Args:
+        url (str): The URL of the PDF document.
+        document_name (str): The name of the document.
+        document_file_path (str): The path to the PDF file.
+    """
+    def __init__(self, url: str, document_name: str, document_file_path: str):
+        super().__init__(url, document_name, document_file_path)
+    async def extract_page_data(
+        self, page_idx: int, image_save_dir: str, **kwargs
+    ) -> Dict[str, Any]:
+        """
+        Extracts a single page from the PDF as an image using fitz and PIL libraries.
+        Args:
+            page_idx (int): The index of the page to process.
+            image_save_dir (str): The directory to save the extracted image.
+            **kwargs: Additional keyword arguments that may be used by fitz and PIL.
+        Returns:
+            Dict[str, Any]: A dictionary containing the processed page data.
+            The dictionary will have the following keys and values:
+            - "page_idx": (int) the index of the page.
+            - "document_name": (str) the name of the document.
+            - "file_path": (str) the local file path where the PDF is stored.
+            - "file_url": (str) the URL of the PDF file.
+            - "image_file_paths": (list) the local file paths where the images are stored.
+        """
+        image_file_paths = []
+        pdf_document = fitz.open(self.document_file_path)
+        page = pdf_document.load_page(page_idx)
+        images = page.get_images(full=True)
+        for img_idx, image in enumerate(images):
+            xref = image[0]
+            base_image = pdf_document.extract_image(xref)
+            image_bytes = base_image["image"]
+            image_ext = base_image["ext"]
+            try:
+                img = Image.open(io.BytesIO(image_bytes))
+                if img.mode in ["L"]:
+                    # images in greyscale looks inverted, need to test on other PDFs
+                    img = ImageOps.invert(img)
+                if img.mode == "CMYK":
+                    img = img.convert("RGB")
+                if image_ext not in ["png", "jpg", "jpeg"]:
+                    image_ext = "png"
+                    image_file_name = f"page{page_idx}_fig{img_idx}.png"
+                    image_file_path = os.path.join(image_save_dir, image_file_name)
+                    img.save(image_file_path, format="PNG")
+                else:
+                    image_file_name = f"page{page_idx}_fig{img_idx}.{image_ext}"
+                    image_file_path = os.path.join(image_save_dir, image_file_name)
+                    with open(image_file_path, "wb") as image_file:
+                        image_file.write(image_bytes)
+                image_file_paths.append(image_file_path)
+            except (UnidentifiedImageError, OSError) as e:
+                print(
+                    f"Skipping image at page {page_idx}, fig {img_idx} due to an error: {e}"
+                )
+                continue
+        pdf_document.close()
+        return {
+            "page_idx": page_idx,
+            "document_name": self.document_name,
+            "file_path": self.document_file_path,
+            "file_url": self.url,
+            "image_file_paths": image_file_paths,
+        }

medrag_multi_modal/document_loader/image_loader/marker_img_loader.py ADDED Viewed

	@@ -0,0 +1,100 @@

+import os
+from typing import Any, Dict
+from marker.convert import convert_single_pdf
+from marker.models import load_all_models
+from .base_img_loader import BaseImageLoader
+class MarkerImageLoader(BaseImageLoader):
+    """
+    `MarkerImageLoader` is a class that extends the `BaseImageLoader` class to handle the extraction and
+    loading of pages from a PDF file as images using the marker library.
+    This class provides functionality to extract images from a PDF file using marker library,
+    and optionally publish these images to a WandB artifact.
+    !!! example "Example Usage"
+        ```python
+        import asyncio
+        import weave
+        import wandb
+        from medrag_multi_modal.document_loader.image_loader import MarkerImageLoader
+        weave.init(project_name="ml-colabs/medrag-multi-modal")
+        wandb.init(project="medrag-multi-modal", entity="ml-colabs")
+        url = "https://archive.org/download/GraysAnatomy41E2015PDF/Grays%20Anatomy-41%20E%20%282015%29%20%5BPDF%5D.pdf"
+        loader = MarkerImageLoader(
+            url=url,
+            document_name="Gray's Anatomy",
+            document_file_path="grays_anatomy.pdf",
+        )
+        asyncio.run(
+            loader.load_data(
+                start_page=31,
+                end_page=36,
+                wandb_artifact_name="grays-anatomy-images-marker",
+                cleanup=False,
+            )
+        )
+        ```
+    Args:
+        url (str): The URL of the PDF document.
+        document_name (str): The name of the document.
+        document_file_path (str): The path to the PDF file.
+    """
+    def __init__(self, url: str, document_name: str, document_file_path: str):
+        super().__init__(url, document_name, document_file_path)
+        self.model_lst = load_all_models()
+    async def extract_page_data(
+        self, page_idx: int, image_save_dir: str, **kwargs
+    ) -> Dict[str, Any]:
+        """
+        Extracts a single page from the PDF as an image using marker library.
+        Args:
+            page_idx (int): The index of the page to process.
+            image_save_dir (str): The directory to save the extracted image.
+            **kwargs: Additional keyword arguments that may be used by marker.
+        Returns:
+            Dict[str, Any]: A dictionary containing the processed page data.
+            The dictionary will have the following keys and values:
+            - "page_idx": (int) the index of the page.
+            - "document_name": (str) the name of the document.
+            - "file_path": (str) the local file path where the PDF is stored.
+            - "file_url": (str) the URL of the PDF file.
+            - "image_file_path": (str) the local file path where the image is stored.
+        """
+        _, images, out_meta = convert_single_pdf(
+            self.document_file_path,
+            self.model_lst,
+            max_pages=1,
+            batch_multiplier=1,
+            start_page=page_idx,
+            ocr_all_pages=True,
+            **kwargs,
+        )
+        image_file_paths = []
+        for img_idx, (_, image) in enumerate(images.items()):
+            image_file_name = f"page{page_idx}_fig{img_idx}.png"
+            image_file_path = os.path.join(image_save_dir, image_file_name)
+            image.save(image_file_path, "png")
+            image_file_paths.append(image_file_path)
+        return {
+            "page_idx": page_idx,
+            "document_name": self.document_name,
+            "file_path": self.document_file_path,
+            "file_url": self.url,
+            "image_file_paths": image_file_paths,
+            "meta": out_meta,
+        }

medrag_multi_modal/document_loader/image_loader/pdf2image_img_loader.py ADDED Viewed

	@@ -0,0 +1,92 @@

+import os
+from typing import Any, Dict
+from pdf2image.pdf2image import convert_from_path
+from .base_img_loader import BaseImageLoader
+class PDF2ImageLoader(BaseImageLoader):
+    """
+    `PDF2ImageLoader` is a class that extends the `BaseImageLoader` class to handle the extraction and
+    loading of pages from a PDF file as images using the pdf2image library.
+    This class provides functionality to convert specific pages of a PDF document into images
+    and optionally publish these images to a WandB artifact.
+    It is like a snapshot image version of each of the pages from the PDF.
+    !!! example "Example Usage"
+        ```python
+        import asyncio
+        import weave
+        import wandb
+        from medrag_multi_modal.document_loader.image_loader import PDF2ImageLoader
+        weave.init(project_name="ml-colabs/medrag-multi-modal")
+        wandb.init(project="medrag-multi-modal", entity="ml-colabs")
+        url = "https://archive.org/download/GraysAnatomy41E2015PDF/Grays%20Anatomy-41%20E%20%282015%29%20%5BPDF%5D.pdf"
+        loader = PDF2ImageLoader(
+            url=url,
+            document_name="Gray's Anatomy",
+            document_file_path="grays_anatomy.pdf",
+        )
+        asyncio.run(
+            loader.load_data(
+                start_page=31,
+                end_page=36,
+                wandb_artifact_name="grays-anatomy-images-pdf2image",
+                cleanup=False,
+            )
+        )
+        ```
+    Args:
+        url (str): The URL of the PDF document.
+        document_name (str): The name of the document.
+        document_file_path (str): The path to the PDF file.
+    """
+    def __init__(self, url: str, document_name: str, document_file_path: str):
+        super().__init__(url, document_name, document_file_path)
+    async def extract_page_data(
+        self, page_idx: int, image_save_dir: str, **kwargs
+    ) -> Dict[str, Any]:
+        """
+        Extracts a single page from the PDF as an image using pdf2image library.
+        Args:
+            page_idx (int): The index of the page to process.
+            image_save_dir (str): The directory to save the extracted image.
+            **kwargs: Additional keyword arguments that may be used by pdf2image.
+        Returns:
+            Dict[str, Any]: A dictionary containing the processed page data.
+            The dictionary will have the following keys and values:
+            - "page_idx": (int) the index of the page.
+            - "document_name": (str) the name of the document.
+            - "file_path": (str) the local file path where the PDF is stored.
+            - "file_url": (str) the URL of the PDF file.
+            - "image_file_path": (str) the local file path where the image is stored.
+        """
+        image = convert_from_path(
+            self.document_file_path,
+            first_page=page_idx + 1,
+            last_page=page_idx + 1,
+            **kwargs,
+        )[0]
+        image_file_name = f"page{page_idx}.png"
+        image_file_path = os.path.join(image_save_dir, image_file_name)
+        image.save(image_file_path)
+        return {
+            "page_idx": page_idx,
+            "document_name": self.document_name,
+            "file_path": self.document_file_path,
+            "file_url": self.url,
+            "image_file_path": image_file_path,
+        }

medrag_multi_modal/document_loader/image_loader/pdfplumber_img_loader.py ADDED Viewed

	@@ -0,0 +1,101 @@

+import os
+from typing import Any, Dict
+import pdfplumber
+from .base_img_loader import BaseImageLoader
+class PDFPlumberImageLoader(BaseImageLoader):
+    """
+    `PDFPlumberImageLoader` is a class that extends the `BaseImageLoader` class to handle the extraction and
+    loading of pages from a PDF file as images using the pdfplumber library.
+    This class provides functionality to extract images from a PDF file using pdfplumber library,
+    and optionally publish these images to a WandB artifact.
+    !!! example "Example Usage"
+        ```python
+        import asyncio
+        import weave
+        import wandb
+        from medrag_multi_modal.document_loader.image_loader import PDFPlumberImageLoader
+        weave.init(project_name="ml-colabs/medrag-multi-modal")
+        wandb.init(project="medrag-multi-modal", entity="ml-colabs")
+        url = "https://archive.org/download/GraysAnatomy41E2015PDF/Grays%20Anatomy-41%20E%20%282015%29%20%5BPDF%5D.pdf"
+        loader = PDFPlumberImageLoader(
+            url=url,
+            document_name="Gray's Anatomy",
+            document_file_path="grays_anatomy.pdf",
+        )
+        asyncio.run(
+            loader.load_data(
+                start_page=32,
+                end_page=37,
+                wandb_artifact_name="grays-anatomy-images-pdfplumber",
+                cleanup=False,
+            )
+        )
+        ```
+    Args:
+        url (str): The URL of the PDF document.
+        document_name (str): The name of the document.
+        document_file_path (str): The path to the PDF file.
+    """
+    def __init__(self, url: str, document_name: str, document_file_path: str):
+        super().__init__(url, document_name, document_file_path)
+    async def extract_page_data(
+        self, page_idx: int, image_save_dir: str, **kwargs
+    ) -> Dict[str, Any]:
+        """
+        Extracts a single page from the PDF as an image using pdfplumber library.
+        Args:
+            page_idx (int): The index of the page to process.
+            image_save_dir (str): The directory to save the extracted image.
+            **kwargs: Additional keyword arguments that may be used by pdfplumber.
+        Returns:
+            Dict[str, Any]: A dictionary containing the processed page data.
+            The dictionary will have the following keys and values:
+            - "page_idx": (int) the index of the page.
+            - "document_name": (str) the name of the document.
+            - "file_path": (str) the local file path where the PDF is stored.
+            - "file_url": (str) the URL of the PDF file.
+            - "image_file_path": (str) the local file path where the image is stored.
+        """
+        with pdfplumber.open(self.document_file_path) as pdf:
+            page = pdf.pages[page_idx]
+            images = page.images
+            image_file_paths = []
+            for img_idx, image in enumerate(images):
+                extracted_image = page.crop(
+                    (
+                        image["x0"],
+                        image["top"],
+                        image["x1"],
+                        image["bottom"],
+                    )
+                ).to_image(resolution=300)
+                image_file_name = f"page{page_idx}_fig{img_idx}.png"
+                image_file_path = os.path.join(image_save_dir, image_file_name)
+                extracted_image.save(image_file_path, "png")
+                image_file_paths.append(image_file_path)
+        return {
+            "page_idx": page_idx,
+            "document_name": self.document_name,
+            "file_path": self.document_file_path,
+            "file_url": self.url,
+            "image_file_paths": image_file_paths,
+        }

medrag_multi_modal/document_loader/image_loader/pymupdf_img_loader.py ADDED Viewed

	@@ -0,0 +1,124 @@

+import io
+import os
+from typing import Any, Dict
+import fitz
+from PIL import Image
+from .base_img_loader import BaseImageLoader
+class PyMuPDFImageLoader(BaseImageLoader):
+    """
+    `PyMuPDFImageLoader` is a class that extends the `BaseImageLoader` class to handle the extraction and
+    loading of pages from a PDF file as images using the pymupdf library.
+    This class provides functionality to extract images from a PDF file using pymupdf library,
+    and optionally publish these images to a WandB artifact.
+    !!! example "Example Usage"
+        ```python
+        import asyncio
+        import weave
+        import wandb
+        from medrag_multi_modal.document_loader.image_loader import PyMuPDFImageLoader
+        weave.init(project_name="ml-colabs/medrag-multi-modal")
+        wandb.init(project="medrag-multi-modal", entity="ml-colabs")
+        url = "https://archive.org/download/GraysAnatomy41E2015PDF/Grays%20Anatomy-41%20E%20%282015%29%20%5BPDF%5D.pdf"
+        loader = PyMuPDFImageLoader(
+            url=url,
+            document_name="Gray's Anatomy",
+            document_file_path="grays_anatomy.pdf",
+        )
+        asyncio.run(
+            loader.load_data(
+                start_page=32,
+                end_page=37,
+                wandb_artifact_name="grays-anatomy-images-pymupdf",
+                cleanup=False,
+            )
+        )
+        ```
+    Args:
+        url (str): The URL of the PDF document.
+        document_name (str): The name of the document.
+        document_file_path (str): The path to the PDF file.
+    """
+    def __init__(self, url: str, document_name: str, document_file_path: str):
+        super().__init__(url, document_name, document_file_path)
+    async def extract_page_data(
+        self, page_idx: int, image_save_dir: str, **kwargs
+    ) -> Dict[str, Any]:
+        """
+        Extracts a single page from the PDF as an image using pymupdf library.
+        Args:
+            page_idx (int): The index of the page to process.
+            image_save_dir (str): The directory to save the extracted image.
+            **kwargs: Additional keyword arguments that may be used by pymupdf.
+        Returns:
+            Dict[str, Any]: A dictionary containing the processed page data.
+            The dictionary will have the following keys and values:
+            - "page_idx": (int) the index of the page.
+            - "document_name": (str) the name of the document.
+            - "file_path": (str) the local file path where the PDF is stored.
+            - "file_url": (str) the URL of the PDF file.
+            - "image_file_paths": (list) the local file paths where the images are stored.
+        """
+        image_file_paths = []
+        pdf_document = fitz.open(self.document_file_path)
+        page = pdf_document[page_idx]
+        images = page.get_images(full=True)
+        for img_idx, image in enumerate(images):
+            xref = image[0]
+            base_image = pdf_document.extract_image(xref)
+            image_bytes = base_image["image"]
+            image_ext = base_image["ext"]
+            if image_ext == "jb2":
+                image_ext = "png"
+            elif image_ext == "jpx":
+                image_ext = "jpg"
+            image_file_name = f"page{page_idx}_fig{img_idx}.{image_ext}"
+            image_file_path = os.path.join(image_save_dir, image_file_name)
+            # For JBIG2 and JPEG2000, we need to convert the image
+            if base_image["ext"] in ["jb2", "jpx"]:
+                try:
+                    pix = fitz.Pixmap(image_bytes)
+                    pix.save(image_file_path)
+                except Exception as err_fitz:
+                    print(f"Error processing image with fitz: {err_fitz}")
+                    # Fallback to using PIL for image conversion
+                    try:
+                        img = Image.open(io.BytesIO(image_bytes))
+                        img.save(image_file_path)
+                    except Exception as err_pil:
+                        print(f"Failed to process image with PIL: {err_pil}")
+                        continue  # Skip this image if both methods fail
+            else:
+                with open(image_file_path, "wb") as image_file:
+                    image_file.write(image_bytes)
+            image_file_paths.append(image_file_path)
+        pdf_document.close()
+        return {
+            "page_idx": page_idx,
+            "document_name": self.document_name,
+            "file_path": self.document_file_path,
+            "file_url": self.url,
+            "image_file_paths": image_file_paths,
+        }

medrag_multi_modal/document_loader/load_image.py DELETED Viewed

@@ -1,131 +0,0 @@
-import asyncio
-import os
-from typing import Optional
-import rich
-import wandb
-import weave
-from pdf2image.pdf2image import convert_from_path
-from PIL import Image
-from medrag_multi_modal.document_loader.text_loader import PyMuPDF4LLMTextLoader
-class ImageLoader(PyMuPDF4LLMTextLoader):
-    """
-    `ImageLoader` is a class that extends the `TextLoader` class to handle the extraction and
-    loading of pages from a PDF file as images.
-    This class provides functionality to convert specific pages of a PDF document into images
-    and optionally publish these images to a Weave dataset.
-    !!! example "Example Usage"
-        ```python
-        import asyncio
-        import wandb
-        from dotenv import load_dotenv
-        from medrag_multi_modal.document_loader import ImageLoader
-        load_dotenv()
-        wandb.init(project="medrag-multi-modal", entity="ml-colabs")
-        url = "https://archive.org/download/GraysAnatomy41E2015PDF/Grays%20Anatomy-41%20E%20%282015%29%20%5BPDF%5D.pdf"
-        loader = ImageLoader(
-            url=url,
-            document_name="Gray's Anatomy",
-            document_file_path="grays_anatomy.pdf",
-        )
-        asyncio.run(
-            loader.load_data(
-                start_page=31,
-                end_page=33,
-                dataset_name="grays-anatomy-images",
-            )
-        )
-        ```
-    Args:
-        url (str): The URL of the PDF document.
-        document_name (str): The name of the document.
-        document_file_path (str): The path to the PDF file.
-    """
-    def __init__(self, url: str, document_name: str, document_file_path: str):
-        super().__init__(url, document_name, document_file_path)
-    def extract_data_from_pdf_file(
-        self, pdf_file: str, page_number: int
-    ) -> Image.Image:
-        image = convert_from_path(
-            pdf_file, first_page=page_number + 1, last_page=page_number + 1
-        )[0]
-        return image
-    async def load_data(
-        self,
-        start_page: Optional[int] = None,
-        end_page: Optional[int] = None,
-        image_save_dir: str = "./images",
-        dataset_name: Optional[str] = None,
-    ):
-        """
-        Asynchronously loads images from a PDF file specified by a URL or local file path,
-        processes the images for the specified range of pages, and optionally publishes them
-        to a Weave dataset.
-        This function reads the specified range of pages from a PDF document, converts each page
-        to an image using the `pdf2image` library, and returns a list of dictionaries containing
-        the image and metadata for each processed page. It processes pages concurrently using
-        `asyncio` for efficiency. If a `dataset_name` is provided, the processed page images are
-        published to Weights & Biases artifact and the corresponding metadata to a Weave dataset
-        with the specified name.
-        Args:
-            start_page (Optional[int]): The starting page index (0-based) to process.
-            end_page (Optional[int]): The ending page index (0-based) to process.
-            dataset_name (Optional[str]): The name of the Weave dataset to publish the
-                processed images to. Defaults to None.
-        Returns:
-            list[dict]: A list of dictionaries, each containing the image and metadata for a
-                processed page.
-        Raises:
-            ValueError: If the specified start_page or end_page is out of bounds of the document's
-                page count.
-        """
-        os.makedirs(image_save_dir, exist_ok=True)
-        start_page, end_page = self.get_page_indices(start_page, end_page)
-        pages = []
-        processed_pages_counter: int = 1
-        total_pages = end_page - start_page
-        async def process_page(page_idx):
-            nonlocal processed_pages_counter
-            image = convert_from_path(
-                self.document_file_path,
-                first_page=page_idx + 1,
-                last_page=page_idx + 1,
-            )[0]
-            pages.append(
-                {
-                    "page_idx": page_idx,
-                    "document_name": self.document_name,
-                    "file_path": self.document_file_path,
-                    "file_url": self.url,
-                }
-            )
-            image.save(os.path.join(image_save_dir, f"{page_idx}.png"))
-            rich.print(f"Processed pages {processed_pages_counter}/{total_pages}")
-            processed_pages_counter += 1
-        tasks = [process_page(page_idx) for page_idx in range(start_page, end_page)]
-        for task in asyncio.as_completed(tasks):
-            await task
-        if dataset_name:
-            artifact = wandb.Artifact(name=dataset_name, type="dataset")
-            artifact.add_dir(local_path=image_save_dir)
-            artifact.save()
-            weave.publish(weave.Dataset(name=dataset_name, rows=pages))
-        return pages

medrag_multi_modal/document_loader/text_loader/marker_text_loader.py CHANGED Viewed

@@ -53,15 +53,16 @@ class MarkerTextLoader(BaseTextLoader):
         """
         Process a single page of the PDF and extract its structured text using marker-pdf.
-        Returns a dictionary with the processed page data.
-        The dictionary will have the following keys and values:
-        - "text": (str) the extracted structured text from the page.
-        - "page_idx": (int) the index of the page.
-        - "document_name": (str) the name of the document.
-        - "file_path": (str) the local file path where the PDF is stored.
-        - "file_url": (str) the URL of the PDF file.
-        - "meta": (dict) the metadata extracted from the page by marker-pdf.
         Args:
             page_idx (int): The index of the page to process.

         """
         Process a single page of the PDF and extract its structured text using marker-pdf.
+        Returns:
+            Dict[str, str]: A dictionary with the processed page data.
+            The dictionary will have the following keys and values:
+            - "text": (str) the extracted structured text from the page.
+            - "page_idx": (int) the index of the page.
+            - "document_name": (str) the name of the document.
+            - "file_path": (str) the local file path where the PDF is stored.
+            - "file_url": (str) the URL of the PDF file.
+            - "meta": (dict) the metadata extracted from the page by marker-pdf.
         Args:
             page_idx (int): The index of the page to process.

medrag_multi_modal/document_loader/text_loader/pdfplumber_text_loader.py CHANGED Viewed

@@ -52,14 +52,15 @@ class PDFPlumberTextLoader(BaseTextLoader):
         """
         Process a single page of the PDF and extract its text using pdfplumber.
-        Returns a dictionary with the processed page data.
-        The dictionary will have the following keys and values:
-        - "text": (str) the extracted text from the page.
-        - "page_idx": (int) the index of the page.
-        - "document_name": (str) the name of the document.
-        - "file_path": (str) the local file path where the PDF is stored.
-        - "file_url": (str) the URL of the PDF file.
         Args:
             page_idx (int): The index of the page to process.

         """
         Process a single page of the PDF and extract its text using pdfplumber.
+        Returns:
+            Dict[str, str]: A dictionary with the processed page data.
+            The dictionary will have the following keys and values:
+            - "text": (str) the extracted text from the page.
+            - "page_idx": (int) the index of the page.
+            - "document_name": (str) the name of the document.
+            - "file_path": (str) the local file path where the PDF is stored.
+            - "file_url": (str) the URL of the PDF file.
         Args:
             page_idx (int): The index of the page to process.

medrag_multi_modal/document_loader/text_loader/pymupdf4llm_text_loader.py CHANGED Viewed

@@ -52,14 +52,15 @@ class PyMuPDF4LLMTextLoader(BaseTextLoader):
         """
         Process a single page of the PDF and convert it to markdown using `pymupdf4llm`.
-        Returns a dictionary with the processed page data.
-        The dictionary will have the following keys and values:
-        - "text": (str) the processed page data in markdown format.
-        - "page_idx": (int) the index of the page.
-        - "document_name": (str) the name of the document.
-        - "file_path": (str) the local file path where the PDF is stored.
-        - "file_url": (str) the URL of the PDF file.
         Args:
             page_idx (int): The index of the page to process.

         """
         Process a single page of the PDF and convert it to markdown using `pymupdf4llm`.
+        Returns:
+            Dict[str, str]: A dictionary with the processed page data.
+            The dictionary will have the following keys and values:
+            - "text": (str) the processed page data in markdown format.
+            - "page_idx": (int) the index of the page.
+            - "document_name": (str) the name of the document.
+            - "file_path": (str) the local file path where the PDF is stored.
+            - "file_url": (str) the URL of the PDF file.
         Args:
             page_idx (int): The index of the page to process.

medrag_multi_modal/document_loader/text_loader/pypdf2_text_loader.py CHANGED Viewed

@@ -52,14 +52,15 @@ class PyPDF2TextLoader(BaseTextLoader):
         """
         Process a single page of the PDF and extract its text using PyPDF2.
-        Returns a dictionary with the processed page data.
-        The dictionary will have the following keys and values:
-        - "text": (str) the extracted text from the page.
-        - "page_idx": (int) the index of the page.
-        - "document_name": (str) the name of the document.
-        - "file_path": (str) the local file path where the PDF is stored.
-        - "file_url": (str) the URL of the PDF file.
         Args:
             page_idx (int): The index of the page to process.

         """
         Process a single page of the PDF and extract its text using PyPDF2.
+        Returns:
+            Dict[str, str]: A dictionary with the processed page data.
+            The dictionary will have the following keys and values:
+            - "text": (str) the extracted text from the page.
+            - "page_idx": (int) the index of the page.
+            - "document_name": (str) the name of the document.
+            - "file_path": (str) the local file path where the PDF is stored.
+            - "file_url": (str) the URL of the PDF file.
         Args:
             page_idx (int): The index of the page to process.

medrag_multi_modal/retrieval/multi_modal_retrieval.py CHANGED Viewed

@@ -1,11 +1,12 @@
 import os
 from typing import Any, Optional
-import wandb
 import weave
 from byaldi import RAGMultiModalModel
 from PIL import Image
 from ..utils import get_wandb_artifact

 import os
 from typing import Any, Optional
 import weave
 from byaldi import RAGMultiModalModel
 from PIL import Image
+import wandb
 from ..utils import get_wandb_artifact

mkdocs.yml CHANGED Viewed

@@ -69,7 +69,13 @@ nav:
       - PyPDF2: 'document_loader/text_loader/pypdf2_text_loader.md'
       - PDFPlumber: 'document_loader/text_loader/pdfplumber_text_loader.md'
       - Marker: 'document_loader/text_loader/marker_text_loader.md'
-    - Image Loader: 'document_loader/load_image.md'
   - Chunking: 'chunking.md'
   - Retrieval:
     - Multi-Modal Retrieval: 'retreival/multi_modal_retrieval.md'

       - PyPDF2: 'document_loader/text_loader/pypdf2_text_loader.md'
       - PDFPlumber: 'document_loader/text_loader/pdfplumber_text_loader.md'
       - Marker: 'document_loader/text_loader/marker_text_loader.md'
+    - Image Loader:
+      - Base: 'document_loader/image_loader/base_img_loader.md'
+      - PDF2Image: 'document_loader/image_loader/pdf2image_img_loader.md'
+      - Marker: 'document_loader/image_loader/marker_img_loader.md'
+      - PDFPlumber: 'document_loader/image_loader/pdfplumber_img_loader.md'
+      - PyMuPDF: 'document_loader/image_loader/pymupdf_img_loader.md'
+      - FitzPIL: 'document_loader/image_loader/fitzpil_img_loader.md'
   - Chunking: 'chunking.md'
   - Retrieval:
     - Multi-Modal Retrieval: 'retreival/multi_modal_retrieval.md'