Daemontatox's picture
Update README.md
cd41ef9 verified
metadata
base_model: unsloth/Qwen2-VL-7B-Instruct
tags:
  - document-parsing
  - information-extraction
  - transformers
  - unsloth
  - qwen2_vl
license: apache-2.0
language:
  - en

image

VisionParser-VL-Expert

Developed by: Daemontatox

Model Type: Fine-tuned Vision-Language Model (VLM)

Base Model: unsloth/Qwen2-VL-7B-Instruct

Finetuned from model: unsloth/Qwen2-VL-7B-Instruct

License: apache-2.0

Languages: en

Tags:

  • document-parsing
  • information-extraction
  • vision-language
  • unsloth
  • qwen2_vl

Model Description

VisionParser-VL-Expert is a fine-tuned version of unsloth/Qwen2-VL-7B-Instruct, designed specifically for document parsing and extraction tasks. It excels in interpreting and extracting structured data from images of documents, such as invoices, forms, and reports.

The finetuning process utilized QLoRA with Unsloth and the Hugging Face TRL library, enabling efficient training with minimal resource overhead. This model demonstrates significant improvements in:

  • Extracting textual information from visually complex layouts.
  • Recognizing tabular and hierarchical data structures.
  • Generating accurate and contextually rich text outputs for document understanding.

Datasets used include a combination of publicly available document datasets (e.g., FUNSD, DocVQA) and proprietary annotated data for domain-specific applications.

Intended Uses

VisionParser-VL-Expert is intended for:

  • Extracting data from scanned documents, invoices, and forms.
  • Parsing and analyzing structured layouts such as tables and charts.
  • Generating textual summaries of visual content in documents.
  • Supporting OCR systems by providing contextually enriched outputs.

Limitations

While VisionParser-VL-Expert is powerful, it has certain limitations:

  • May struggle with low-quality or heavily distorted images.
  • Biases from training data might influence performance.
  • Limited support for languages other than English.
  • Performance can vary with highly complex or novel document layouts.

How to Use

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "daemontatox/visionparser-vl-expert"  # Replace with the actual model name

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Example usage with text and image
prompt = "Extract key details from the document: "
image_path = "path/to/your/document_image.jpg"  # Replace with your image path

inputs = tokenizer(prompt, images=image_path, return_tensors="pt")
outputs = model.generate(**inputs)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

Acknowledgements

Special thanks to the Unsloth team for their robust tools enabling efficient fine-tuning. This model was developed with the help of open-source libraries and community datasets.