--- license: mit --- # EndpointHandler `EndpointHandler` is a Python class that processes image and text data to generate embeddings and similarity scores using the ColQwen2 model—a visual retriever based on Qwen2-VL-2B-Instruct with the ColBERT strategy. This handler is optimized for retrieving documents and visual information based on their visual and textual features. ## Overview - **Efficient Document Retrieval**: Uses the ColQwen2 model to produce embeddings for images and text for accurate document retrieval. - **Multi-vector Representation**: Generates ColBERT-style multi-vector embeddings for improved similarity search. - **Flexible Image Resolution**: Supports dynamic image resolution without altering the aspect ratio, capped at 768 patches for memory efficiency. - **Device Compatibility**: Automatically utilizes available CUDA devices or defaults to CPU. ## Model Details The **ColQwen2** model extends Qwen2-VL-2B with a focus on vision-language tasks, making it suitable for content indexing and retrieval. Key features include: - **Training**: Pre-trained with a batch size of 256 over 5 epochs, with a modified pad token. - **Input Flexibility**: Handles various image resolutions without resizing, ensuring accurate multi-vector representation. - **Similarity Scoring**: Utilizes a ColBERT-style scoring approach for efficient retrieval across image and text modalities. This base version is untrained, providing deterministic initialization of the projection layer for further customization. ## How to Use The following example demonstrates how to use `EndpointHandler` for processing PDF documents and text. PDF pages are converted to base64 images, which are then passed as input alongside text data to the handler. ### Example Script ```python import torch from pdf2image import convert_from_path import base64 from io import BytesIO import requests # Function to convert PIL Image to base64 string def pil_image_to_base64(image): """Converts a PIL Image to a base64 encoded string.""" buffer = BytesIO() image.save(buffer, format="PNG") return base64.b64encode(buffer.getvalue()).decode() # Function to convert PDF pages to base64 images def convert_pdf_to_base64_images(pdf_path): """Converts PDF pages to base64 encoded images.""" pages = convert_from_path(pdf_path) return [pil_image_to_base64(page) for page in pages] # Function to send payload to API and retrieve response def query_api(payload, api_url, headers): """Sends a POST request to the API and returns the response.""" response = requests.post(api_url, headers=headers, json=payload) return response.json() # Main execution if __name__ == "__main__": # Convert PDF pages to base64 encoded images encoded_images = convert_pdf_to_base64_images('document.pdf') # Prepare payload payload = { "inputs": [], "image": encoded_images, "text": ["example query text"] } # API configuration API_URL = "https://your-api-url" headers = { "Accept": "application/json", "Authorization": "Bearer your_access_token", "Content-Type": "application/json" } # Query the API and get output output = query_api(payload=payload, api_url=API_URL, headers=headers) print(output) ``` ## Inputs and Outputs ### Input Format The `EndpointHandler` expects a dictionary containing: - **image**: A list of base64-encoded strings for images (e.g., PDF pages converted to images). - **text**: A list of text strings representing queries or document contents. - **batch_size** (optional): The batch size for processing images and text. Defaults to `4`. Example payload: ```json { "image": ["base64_image_string_1", "base64_image_string_2"], "text": ["sample text 1", "sample text 2"], "batch_size": 4 } ``` ### Output Format The handler returns a dictionary with the following keys: - **image**: List of embeddings for each image. - **text**: List of embeddings for each text entry. - **scores**: List of similarity scores between the image and text embeddings. Example output: ```json { "image": [[0.12, 0.34, ...], [0.56, 0.78, ...]], "text": [[0.11, 0.22, ...], [0.33, 0.44, ...]], "scores": [[0.87, 0.45], [0.23, 0.67]] } ``` ### Error Handling If any issues occur during processing (e.g., decoding images or model inference), the handler logs the error and returns an error message in the output dictionary.