File size: 6,135 Bytes
2022348
7c978fb
2022348
 
 
 
7c978fb
328d0b3
 
 
2022348
 
 
328d0b3
7785bdf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2022348
 
328d0b3
 
5377bb4
2022348
5377bb4
 
 
2022348
328d0b3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2022348
328d0b3
5377bb4
328d0b3
 
 
 
 
 
 
 
 
 
 
 
 
5377bb4
 
328d0b3
 
 
7785bdf
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
---
base_model: unsloth/Llama-3.2-11B-Vision-Instruct
tags:
- text-generation-inference
- transformers
- unsloth
- mllama
- vision-language
- document-understanding
- data-extraction
license: apache-2.0
language:
- en
library_name: transformers
model-index:
- name: PixelParse_AI
  results:
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: IFEval (0-Shot)
      type: wis-k/instruction-following-eval
      split: train
      args:
        num_few_shot: 0
    metrics:
    - type: inst_level_strict_acc and prompt_level_strict_acc
      value: 43.83
      name: averaged accuracy
    source:
      url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=Daemontatox%2FPixelParse_AI
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: BBH (3-Shot)
      type: SaylorTwift/bbh
      split: test
      args:
        num_few_shot: 3
    metrics:
    - type: acc_norm
      value: 29.03
      name: normalized accuracy
    source:
      url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=Daemontatox%2FPixelParse_AI
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: MATH Lvl 5 (4-Shot)
      type: lighteval/MATH-Hard
      split: test
      args:
        num_few_shot: 4
    metrics:
    - type: exact_match
      value: 14.43
      name: exact match
    source:
      url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=Daemontatox%2FPixelParse_AI
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: GPQA (0-shot)
      type: Idavidrein/gpqa
      split: train
      args:
        num_few_shot: 0
    metrics:
    - type: acc_norm
      value: 9.84
      name: acc_norm
    source:
      url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=Daemontatox%2FPixelParse_AI
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: MuSR (0-shot)
      type: TAUR-Lab/MuSR
      args:
        num_few_shot: 0
    metrics:
    - type: acc_norm
      value: 9.25
      name: acc_norm
    source:
      url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=Daemontatox%2FPixelParse_AI
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: MMLU-PRO (5-shot)
      type: TIGER-Lab/MMLU-Pro
      config: main
      split: test
      args:
        num_few_shot: 5
    metrics:
    - type: acc
      value: 30.87
      name: accuracy
    source:
      url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=Daemontatox%2FPixelParse_AI
      name: Open LLM Leaderboard
---


![image](./image.webp)
# Vision-Language Model for Document Data Extraction  

- **Developed by:** Daemontatox  
- **License:** apache-2.0  
- **Finetuned from model:** unsloth/Llama-3.2-11B-Vision-Instruct  

## Overview  

This Vision-Language Model (VLM) is purpose-built for extracting structured and unstructured data from various types of documents, including but not limited to:  
- Invoices  
- Timesheets  
- Contracts  
- Forms  
- Receipts  

By utilizing advanced multimodal learning capabilities, this model understands both text and visual layout features, enabling it to parse even complex document structures.  

## Key Features  

1. **Accurate Data Extraction:**  
   - Automatically detects and extracts key fields such as dates, names, amounts, itemized details, and more.  
   - Outputs data in clean and well-structured JSON format.  

2. **Robust Multimodal Understanding:**  
   - Processes both text and visual layout elements (tables, headers, footers).  
   - Adapts to various document formats and layouts without additional fine-tuning.  

3. **Optimized Performance:**  
   - Fine-tuned using [Unsloth](https://github.com/unslothai/unsloth), enabling 2x faster training.  
   - Employs Hugging Face’s TRL library for parameter-efficient fine-tuning.  

4. **Flexible Deployment:**  
   - Compatible with a wide range of platforms for integration into document processing pipelines.  
   - Optimized for inference on GPUs and high-performance environments.  

## Use Cases  

- **Enterprise Automation:** Automate data entry and document processing tasks in finance, HR, and legal domains.  
- **E-invoicing:** Extract critical invoice details for seamless integration with ERP systems.  
- **Compliance:** Extract and structure data for auditing and regulatory compliance reporting.  

## Training and Fine-Tuning  

The fine-tuning process leveraged Unsloth's efficiency optimizations, reducing training time while maintaining high accuracy. The model was trained on a diverse dataset of scanned documents and synthetic examples to ensure robustness across real-world scenarios.  



## Acknowledgments  

This model was fine-tuned using the powerful capabilities of the [Unsloth](https://github.com/unslothai/unsloth) framework, which significantly accelerates the training of large models.  

[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)  

---  

# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard)
Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/Daemontatox__PixelParse_AI-details)!
Summarized results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/contents/viewer/default/train?q=Daemontatox/PixelParse_AI)!

|      Metric       |Value (%)|
|-------------------|--------:|
|**Average**        |    22.87|
|IFEval (0-Shot)    |    43.83|
|BBH (3-Shot)       |    29.03|
|MATH Lvl 5 (4-Shot)|    14.43|
|GPQA (0-shot)      |     9.84|
|MuSR (0-shot)      |     9.25|
|MMLU-PRO (5-shot)  |    30.87|