amaye15
/

colqwen2-1.0-alpha-inference

Visual Document Retrieval

Safetensors

Inference Endpoints

Model card Files Files and versions Community

amaye15 commited on Nov 7, 2024

Commit

e165930

1 Parent(s): 64262c3

handler clean up & readme updated

Browse files

Files changed (2) hide show

README.md +114 -0
handler.py +0 -136

README.md CHANGED Viewed

@@ -1,3 +1,117 @@
 ---
 license: mit
 ---

 ---
 license: mit
 ---
+# EndpointHandler
+`EndpointHandler` is a Python class that processes image and text data to generate embeddings and similarity scores using the ColQwen2 model—a visual retriever based on Qwen2-VL-2B-Instruct with the ColBERT strategy. This handler is optimized for retrieving documents and visual information based on their visual and textual features.
+## Overview
+- **Efficient Document Retrieval**: Uses the ColQwen2 model to produce embeddings for images and text for accurate document retrieval.
+- **Multi-vector Representation**: Generates ColBERT-style multi-vector embeddings for improved similarity search.
+- **Flexible Image Resolution**: Supports dynamic image resolution without altering the aspect ratio, capped at 768 patches for memory efficiency.
+- **Device Compatibility**: Automatically utilizes available CUDA devices or defaults to CPU.
+## Model Details
+The **ColQwen2** model extends Qwen2-VL-2B with a focus on vision-language tasks, making it suitable for content indexing and retrieval. Key features include:
+- **Training**: Pre-trained with a batch size of 256 over 5 epochs, with a modified pad token.
+- **Input Flexibility**: Handles various image resolutions without resizing, ensuring accurate multi-vector representation.
+- **Similarity Scoring**: Utilizes a ColBERT-style scoring approach for efficient retrieval across image and text modalities.
+This base version is untrained, providing deterministic initialization of the projection layer for further customization.
+## How to Use
+The following example demonstrates how to use `EndpointHandler` for processing PDF documents and text. PDF pages are converted to base64 images, which are then passed as input alongside text data to the handler.
+### Example Script
+```python
+import torch
+from pdf2image import convert_from_path
+import base64
+from io import BytesIO
+import requests
+# Function to convert PIL Image to base64 string
+def pil_image_to_base64(image):
+    """Converts a PIL Image to a base64 encoded string."""
+    buffer = BytesIO()
+    image.save(buffer, format="PNG")
+    return base64.b64encode(buffer.getvalue()).decode()
+# Function to convert PDF pages to base64 images
+def convert_pdf_to_base64_images(pdf_path):
+    """Converts PDF pages to base64 encoded images."""
+    pages = convert_from_path(pdf_path)
+    return [pil_image_to_base64(page) for page in pages]
+# Function to send payload to API and retrieve response
+def query_api(payload, api_url, headers):
+    """Sends a POST request to the API and returns the response."""
+    response = requests.post(api_url, headers=headers, json=payload)
+    return response.json()
+# Main execution
+if __name__ == "__main__":
+    # Convert PDF pages to base64 encoded images
+    encoded_images = convert_pdf_to_base64_images('document.pdf')
+    # Prepare payload
+    payload = {
+        "inputs": [],
+        "image": encoded_images,
+        "text": ["example query text"]
+    }
+    # API configuration
+    API_URL = "https://your-api-url"
+    headers = {
+        "Accept": "application/json",
+        "Authorization": "Bearer your_access_token",
+        "Content-Type": "application/json"
+    }
+    # Query the API and get output
+    output = query_api(payload=payload, api_url=API_URL, headers=headers)
+    print(output)
+```
+## Inputs and Outputs
+### Input Format
+The `EndpointHandler` expects a dictionary containing:
+- **image**: A list of base64-encoded strings for images (e.g., PDF pages converted to images).
+- **text**: A list of text strings representing queries or document contents.
+- **batch_size** (optional): The batch size for processing images and text. Defaults to `4`.
+Example payload:
+```json
+{
+    "image": ["base64_image_string_1", "base64_image_string_2"],
+    "text": ["sample text 1", "sample text 2"],
+    "batch_size": 4
+}
+```
+### Output Format
+The handler returns a dictionary with the following keys:
+- **image**: List of embeddings for each image.
+- **text**: List of embeddings for each text entry.
+- **scores**: List of similarity scores between the image and text embeddings.
+Example output:
+```json
+{
+    "image": [[0.12, 0.34, ...], [0.56, 0.78, ...]],
+    "text": [[0.11, 0.22, ...], [0.33, 0.44, ...]],
+    "scores": [[0.87, 0.45], [0.23, 0.67]]
+}
+```
+### Error Handling
+If any issues occur during processing (e.g., decoding images or model inference), the handler logs the error and returns an error message in the output dictionary.

handler.py CHANGED Viewed

@@ -1,139 +1,3 @@
-# import torch
-# from typing import Dict, Any, List
-# from PIL import Image
-# import base64
-# from io import BytesIO
-# class EndpointHandler:
-#     """
-#     A handler class for processing image and text data, generating embeddings using a specified model and processor.
-#     Attributes:
-#         model: The pre-trained model used for generating embeddings.
-#         processor: The pre-trained processor used to process images and text before model inference.
-#         device: The device (CPU or CUDA) used to run model inference.
-#         default_batch_size: The default batch size for processing images and text in batches.
-#     """
-#     def __init__(self, path: str = "", default_batch_size: int = 4):
-#         """
-#         Initializes the EndpointHandler with a specified model path and default batch size.
-#         Args:
-#             path (str): Path to the pre-trained model and processor.
-#             default_batch_size (int): Default batch size for processing images and text data.
-#         """
-#         from colpali_engine.models import ColQwen2, ColQwen2Processor
-#         self.model = ColQwen2.from_pretrained(
-#             path,
-#             torch_dtype=torch.bfloat16,
-#             device_map=(
-#                 "cuda:0" if torch.cuda.is_available() else "cpu"
-#             ),  # Set device map based on availability
-#         ).eval()
-#         self.processor = ColQwen2Processor.from_pretrained(path)
-#         self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
-#         self.model.to(self.device)
-#         self.default_batch_size = default_batch_size
-#     def _process_image_batch(self, images: List[Image.Image]) -> List[List[float]]:
-#         """
-#         Processes a batch of images and generates embeddings.
-#         Args:
-#             images (List[Image.Image]): List of images to process.
-#         Returns:
-#             List[List[float]]: List of embeddings for each image.
-#         """
-#         batch_images = self.processor.process_images(images).to(self.device)
-#         with torch.no_grad():
-#             image_embeddings = self.model(**batch_images)
-#         return image_embeddings.cpu().tolist()
-#     def _process_text_batch(self, texts: List[str]) -> List[List[float]]:
-#         """
-#         Processes a batch of text queries and generates embeddings.
-#         Args:
-#             texts (List[str]): List of text queries to process.
-#         Returns:
-#             List[List[float]]: List of embeddings for each text query.
-#         """
-#         batch_queries = self.processor.process_queries(texts).to(self.device)
-#         with torch.no_grad():
-#             query_embeddings = self.model(**batch_queries)
-#         return query_embeddings.cpu().tolist()
-#     def __call__(self, data: Dict[str, Any]) -> Dict[str, Any]:
-#         """
-#         Processes input data containing base64-encoded images and text queries, decodes them, and generates embeddings.
-#         Args:
-#             data (Dict[str, Any]): Dictionary containing input images, text queries, and optional batch size.
-#         Returns:
-#             Dict[str, Any]: Dictionary containing generated embeddings for images and text or error messages.
-#         """
-#         images_data = data.get("image", [])
-#         text_data = data.get("text", [])
-#         batch_size = data.get("batch_size", self.default_batch_size)
-#         # Decode and process images
-#         images = []
-#         if images_data:
-#             for img_data in images_data:
-#                 if isinstance(img_data, str):
-#                     try:
-#                         image_bytes = base64.b64decode(img_data)
-#                         image = Image.open(BytesIO(image_bytes)).convert("RGB")
-#                         images.append(image)
-#                     except Exception as e:
-#                         return {"error": f"Invalid image data: {e}"}
-#                 else:
-#                     return {"error": "Images should be base64-encoded strings."}
-#         image_embeddings = []
-#         for i in range(0, len(images), batch_size):
-#             batch_images = images[i : i + batch_size]
-#             batch_embeddings = self._process_image_batch(batch_images)
-#             image_embeddings.extend(batch_embeddings)
-#         # Process text data
-#         text_embeddings = []
-#         if text_data:
-#             for i in range(0, len(text_data), batch_size):
-#                 batch_texts = text_data[i : i + batch_size]
-#                 batch_text_embeddings = self._process_text_batch(batch_texts)
-#                 text_embeddings.extend(batch_text_embeddings)
-#         # Compute similarity scores if both image and text embeddings are available
-#         scores = []
-#         if image_embeddings and text_embeddings:
-#             # Convert embeddings to tensors for scoring
-#             image_embeddings_tensor = torch.tensor(image_embeddings).to(self.device)
-#             text_embeddings_tensor = torch.tensor(text_embeddings).to(self.device)
-#             with torch.no_grad():
-#                 scores = (
-#                     self.processor.score_multi_vector(
-#                         text_embeddings_tensor, image_embeddings_tensor
-#                     )
-#                     .cpu()
-#                     .tolist()
-#                 )
-#         return {"image": image_embeddings, "text": text_embeddings, "scores": scores}
 import torch
 from typing import Dict, Any, List
 from PIL import Image

 import torch
 from typing import Dict, Any, List
 from PIL import Image