ybelkada commited on
Commit
9043e02
1 Parent(s): 7880534

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +88 -0
README.md ADDED
@@ -0,0 +1,88 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - fr
5
+ - ro
6
+ - de
7
+ - multilingual
8
+ pipeline_tag: image-to-text
9
+ tags:
10
+ - image-captioning
11
+ license: apache-2.0
12
+ ---
13
+ # Model card for MatCha - fine-tuned on Chart2text-pew
14
+
15
+ ![pull_figure](https://s3.amazonaws.com/moonup/production/uploads/62441d1d9fdefb55a0b7d12c/RFZQUbNbtO8jDPdlTPYHn.png)
16
+
17
+ This model is the MatCha model, fine-tuned on Chart2text-pew dataset.
18
+
19
+ # Table of Contents
20
+
21
+ 0. [TL;DR](#TL;DR)
22
+ 1. [Using the model](#using-the-model)
23
+ 2. [Contribution](#contribution)
24
+ 3. [Citation](#citation)
25
+
26
+ # TL;DR
27
+
28
+ The abstract of the paper states that:
29
+ > Visual language data such as plots, charts, and infographics are ubiquitous in the human world. However, state-of-the-art visionlanguage models do not perform well on these data. We propose MATCHA (Math reasoning and Chart derendering pretraining) to enhance visual language models’ capabilities jointly modeling charts/plots and language data. Specifically we propose several pretraining tasks that cover plot deconstruction and numerical reasoning which are the key capabilities in visual language modeling. We perform the MATCHA pretraining starting from Pix2Struct, a recently proposed imageto-text visual language model. On standard benchmarks such as PlotQA and ChartQA, MATCHA model outperforms state-of-the-art methods by as much as nearly 20%. We also examine how well MATCHA pretraining transfers to domains such as screenshot, textbook diagrams, and document figures and observe overall improvement, verifying the usefulness of MATCHA pretraining on broader visual language tasks.
30
+
31
+ # Using the model
32
+
33
+ ## Converting from T5x to huggingface
34
+
35
+ You can use the [`convert_pix2struct_checkpoint_to_pytorch.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/models/pix2struct/convert_pix2struct_original_pytorch_to_hf.py) script as follows:
36
+ ```bash
37
+ python convert_pix2struct_checkpoint_to_pytorch.py --t5x_checkpoint_path PATH_TO_T5X_CHECKPOINTS --pytorch_dump_path PATH_TO_SAVE --is_vqa
38
+ ```
39
+ if you are converting a large model, run:
40
+ ```bash
41
+ python convert_pix2struct_checkpoint_to_pytorch.py --t5x_checkpoint_path PATH_TO_T5X_CHECKPOINTS --pytorch_dump_path PATH_TO_SAVE --use-large --is_vqa
42
+ ```
43
+ Once saved, you can push your converted model with the following snippet:
44
+ ```python
45
+ from transformers import Pix2StructForConditionalGeneration, Pix2StructProcessor
46
+
47
+ model = Pix2StructForConditionalGeneration.from_pretrained(PATH_TO_SAVE)
48
+ processor = Pix2StructProcessor.from_pretrained(PATH_TO_SAVE)
49
+
50
+ model.push_to_hub("USERNAME/MODEL_NAME")
51
+ processor.push_to_hub("USERNAME/MODEL_NAME")
52
+ ```
53
+
54
+ ## Run a prediction
55
+
56
+ You can run a prediction by querying an input image together with a question as follows:
57
+ ```python
58
+ from transformers import Pix2StructForConditionalGeneration, Pix2StructProcessor
59
+ import requests
60
+ from PIL import Image
61
+
62
+ model = Pix2StructForConditionalGeneration.from_pretrained('google/matcha-chartqa')
63
+ processor = Pix2StructProcessor.from_pretrained('google/matcha-chartqa')
64
+ url = "https://raw.githubusercontent.com/vis-nlp/ChartQA/main/ChartQA%20Dataset/val/png/5090.png"
65
+ image = Image.open(requests.get(url, stream=True).raw)
66
+
67
+ inputs = processor(images=image, text="Generate underlying data table of the figure below:", return_tensors="pt")
68
+ predictions = model.generate(**inputs, max_new_tokens=512)
69
+ print(processor.decode(predictions[0], skip_special_tokens=True))
70
+ ```
71
+
72
+ # Contribution
73
+
74
+ This model was originally contributed by Fangyu Liu, Francesco Piccinno et al. and added to the Hugging Face ecosystem by [Younes Belkada](https://huggingface.co/ybelkada).
75
+
76
+ # Citation
77
+
78
+ If you want to cite this work, please consider citing the original paper:
79
+ ```
80
+ @misc{liu2022matcha,
81
+ title={MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering},
82
+ author={Fangyu Liu and Francesco Piccinno and Syrine Krichene and Chenxi Pang and Kenton Lee and Mandar Joshi and Yasemin Altun and Nigel Collier and Julian Martin Eisenschlos},
83
+ year={2022},
84
+ eprint={2212.09662},
85
+ archivePrefix={arXiv},
86
+ primaryClass={cs.CL}
87
+ }
88
+ ```