openbmb/MiniCPM-V-2_6 · Issue with Deploying Fine-Tuned MiniCPM Model on RunPod Serverless

Hi everyone,

I am currently facing an issue while trying to deploy my fine-tuned version of the openbmb/MiniCPM-V-2_6 model, which I've uploaded to Hugging Face as Zorro123444/invoice_extracter_2, on RunPod Serverless with an NVIDIA RTX A6000 GPU (48GB). Despite the server appearing to load the model, the process fails without any error message, and I only see the following log output:

\rLoading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s]\rLoading checkpoint shards: 25%|██▌ | 1/4 [00:03<00:11, 3.77s/it]\rLoading checkpoint shards: 50%|█████ | 2/4 [00:07<00:06, 3.48s/it]\rLoading checkpoint shards: 75%|███████▌ | 3/4 [00:10<00:03, 3.30s/it]\rLoading checkpoint shards: 100%|██████████| 4/4 [00:11<00:00, 2.65s/it]\rLoading checkpoint shards: 100%|██████████| 4/4 [00:11<00:00, 2.95s/it]\n

Details:

Model Type: openbmb/MiniCPM-V-2_6 from Hugging Face
Adapter Type: Zorro123444/invoice_extracter_2
GPU: NVIDIA RTX A6000 (48GB)
Docker Image: docker.io/nemesis55/mincpm-llama3-v2_5:latest (ignore the naming)
Error: No explicit error message is generated; it just fails silently during the model loading stage.

Code Snippet (Handler):

import base64
import torch
from PIL import Image
import fitz  # PyMuPDF for handling PDFs
import pytesseract
from transformers import AutoTokenizer, AutoModel
from peft import PeftModel
import runpod
from huggingface_hub import login
import traceback

# Constants
MODEL_DPI = 600
MODEL_TYPE = "openbmb/MiniCPM-V-2_6"
ADAPTOR_TYPE = "Zorro123444/invoice_extracter_2"
CACHE_DIR_MODEL = "./cache_dir/model"
CACHE_DIR_ADAPTOR = "./cache_dir/adaptor"

# Load Model and Tokenizer
def load_model_and_tokenizer():
    """Load the main model and tokenizer."""
    try:
        tokenizer = AutoTokenizer.from_pretrained(MODEL_TYPE, trust_remote_code=True)
        print("Tokenizer loaded.")
        
        model = AutoModel.from_pretrained(MODEL_TYPE, trust_remote_code=True)
        print("Base Model loaded successfully.")
        
        lora_model = PeftModel.from_pretrained(model, ADAPTOR_TYPE, device_map="auto", trust_remote_code=True, cache_dir=CACHE_DIR_MODEL).cuda().eval()
        print("Model and adapter loaded successfully.")
        
        return lora_model, tokenizer
    except Exception as e:
        print(f"Error loading model or adapter: {str(e)}")
        print(traceback.format_exc())

# Convert PDF Page to Image
def pdf_to_image(pdf_bytes, dpi=MODEL_DPI):
    """Convert a single-page PDF to an image."""
    try:
        pdf_document = fitz.open(stream=pdf_bytes, filetype="pdf")
        if len(pdf_document) < 1:
            raise ValueError("The PDF does not contain any pages.")
        page = pdf_document.load_page(0)
        pix = page.get_pixmap(dpi=dpi)
        return Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
    except Exception as e:
        print(f"Error converting PDF to image: {e}")
        raise ValueError(f"Error converting PDF to image: {e}")

# Extract Text using OCR
def extract_text_from_image(pdf_bytes, dpi=MODEL_DPI):
    """Extract text from an image derived from the PDF."""
    try:
        image = pdf_to_image(pdf_bytes, dpi)
        text = pytesseract.image_to_string(image)
        print(f"Extracted text length: {len(text)} characters.")
        return text
    except Exception as e:
        print(f"Error during text extraction: {e}")
        raise RuntimeError(f"Error during text extraction: {e}")

# Generate Detailed Prompt
def generate_prompt(pdf_bytes, ocr_data):
    """Create the detailed prompt for the model."""
    try:
        image = pdf_to_image(pdf_bytes)
        question = (
            "You are an AI model specialized in data extraction from invoices. "
            "Below, you are provided with OCR-extracted text from an invoice. "
            "Your task is to analyze the OCR data and extract key details to structure them as a JSON object.\n\n"
            f"### OCR Data:\n{ocr_data}\n\n"
            "### Instructions:\n"
            "1. Extract all the required fields as specified in the JSON structure below.\n"
            "2. Ensure the output is a syntactically valid JSON string.\n"
            "3. If a field is missing or unavailable in the OCR text, set its value to an empty string \"\".\n"
            "4. Maintain the exact formatting of numeric values and dates as found in the input.\n"
            "5. Do not include additional explanations or comments in your output.\n\n"
            "### JSON Structure:\n"
            "{\n"
            "    \"OrderNumber\": \"<string>\",\n"
            "    \"InvoiceNumber\": \"<string>\",\n"
            "    \"BuyerName\": \"<string>\",\n"
            "    \"BuyerAddress1\": \"<string>\",\n"
            "    \"BuyerZipCode\": \"<string>\",\n"
            "    \"BuyerCity\": \"<string>\",\n"
            "    \"BuyerCountry\": \"<string>\",\n"
            "    \"ReceiverName\": \"<string>\",\n"
            "    \"ReceiverAddress1\": \"<string>\",\n"
            "    \"ReceiverZipCode\": \"<string>\",\n"
            "    \"ReceiverCity\": \"<string>\",\n"
            "    \"ReceiverCountry\": \"<string>\",\n"
            "    \"SellerName\": \"<string>\",\n"
            "    \"NetAmount\": \"<string>\",\n"
            "    \"OrderDate\": \"<YYYY-MM-DD>\",\n"
            "    \"Currency\": \"<string>\",\n"
            "    \"TermsOfDelCode\": \"<string>\",\n"
            "    \"OrderItems\": [\n"
            "        {\n"
            "            \"ArticleNumber\": \"<string>\",\n"
            "            \"Description\": \"<string>\",\n"
            "            \"HsCode\": \"<string>\",\n"
            "            \"CountryOfOrigin\": \"<string>\",\n"
            "            \"Quantity\": \"<string>\",\n"
            "            \"NetWeight\": \"<string>\",\n"
            "            \"NetAmount\": \"<string>\",\n"
            "            \"PricePerPiece\": \"<string>\",\n"
            "            \"EclEuNO\": \"<string>\"\n"
            "        }\n"
            "    ],\n"
            "    \"NetWeight\": \"<string>\",\n"
            "    \"NumberOfUnits\": \"<string>\"\n"
            "}\n\n"
            "### Note:\n"
            "Ensure the JSON structure is returned exactly as shown above, with appropriate values extracted from the OCR data."
        )
        
        return [{"role": "user", "content": [image, question]}]
    except Exception as e:
        print(f"Error generating prompt: {e}")
        raise RuntimeError(f"Error generating prompt: {e}")

# Handle Inference
def perform_inference(messages, model, tokenizer):
    """Perform model inference."""
    try:
        with torch.no_grad():
            response = model.chat(image=None, msgs=messages, tokenizer=tokenizer, max_new_tokens=4096)
        return response
    except Exception as e:
        print(f"Inference failed: {e}")
        raise RuntimeError(f"Inference failed: {e}")

# Main Request Handler
def run(request):
    """Process incoming requests."""
    try:
        input_data = request.get("input", {})
        pdf_data = input_data.get("pdf_data")
        ocr_data = input_data.get("ocr_data")

        if not pdf_data:
            return {"error": "Missing PDF data."}

        pdf_bytes = base64.b64decode(pdf_data)

        if not ocr_data:
            print("No OCR data provided. Extracting...")
            ocr_data = extract_text_from_image(pdf_bytes)

        prompt = generate_prompt(pdf_bytes, ocr_data)
        response = perform_inference(prompt, model, tokenizer)
        return {"response": response}
    except Exception as e:
        print(f"Exception during processing: {e}")
        return {"error": f"Exception during processing: {e}"}

# Log in with your Hugging Face token
login("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")    
model, tokenizer = load_model_and_tokenizer()

# Initialize and Start RunPod Handler
if __name__ == "__main__":
    print("Initializing RunPod serverless handler.")
    runpod.serverless.start({"handler": run})

Dockerfile:

# Use the RunPod PyTorch image with CUDA as the base image
FROM runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04

# Set the working directory in the container
WORKDIR /

# Install system dependencies (needed for some libraries like pytesseract, fitz)
RUN apt-get update && apt-get install -y \
    build-essential \
    libsm6 \
    libxext6 \
    libxrender-dev \
    tesseract-ocr \
    libmagic1 \
    && rm -rf /var/lib/apt/lists/*

# Install Python dependencies
COPY requirements.txt ./
RUN pip install --no-cache-dir -r requirements.txt

# Copy the handler.py file into the container
COPY handler.py .

# Expose port for the API (if you run a local server, typically 8000)
EXPOSE 8000

# Run the application
CMD ["python", "handler.py"]

Requirements:

Pillow
torch
torchvision
transformers
sentencepiece
decord
peft
pytesseract
huggingface-hub
runpod
pymupdf
flash-attn

Observations:
The server seems to start loading the model shards but doesn't show any clear failure or memory-related issue.
I suspect there might be a GPU Out-Of-Memory (OOM) problem given the size of the model and the RTX A6000’s memory capacity.
The model loads fine locally but is failing on the serverless setup with the log hanging on the checkpoint loading.
Questions:
How can I debug the issue? Any suggestions on enabling better logging or tracking GPU usage?
Could there be a GPU OOM issue? If so, how can I optimize the deployment for larger models like this one?
Any specific configurations I might be missing in the Docker setup or model loading code?
Looking forward to any insights or suggestions!

Thanks in advance.