Spaces:
Sleeping
Sleeping
updating the number of models and using a new pretrained model for finetuning a BLIP
Browse files- README.md +8 -4
- app.py +1 -1
- inference.py +1 -1
README.md
CHANGED
@@ -8,17 +8,21 @@ widget:
|
|
8 |
src: "617.jpg"
|
9 |
---
|
10 |
|
11 |
-
# This is a simple VQA system using Hugging Face, PyTorch and
|
12 |
-------------
|
13 |
|
14 |
In this repository we created a simple VQA system capable of recognize spatial and context information of fashion images (e.g. clothes color and details).
|
15 |
|
16 |
-
The project was based in this paper **FashionVQA: A Domain-Specific Visual Question Answering System** [[1]](#1).
|
17 |
-
|
18 |
|
|
|
19 |
|
20 |
|
21 |
## References
|
22 |
<a id="1">[1]</a>
|
23 |
Min Wang and Ata Mahjoubfar and Anupama Joshi, 2022
|
24 |
-
FashionVQA: A Domain-Specific Visual Question Answering System
|
|
|
|
|
|
|
|
|
|
8 |
src: "617.jpg"
|
9 |
---
|
10 |
|
11 |
+
# This is a simple VQA system using Hugging Face, PyTorch and VQA models
|
12 |
-------------
|
13 |
|
14 |
In this repository we created a simple VQA system capable of recognize spatial and context information of fashion images (e.g. clothes color and details).
|
15 |
|
16 |
+
The project was based in this paper **FashionVQA: A Domain-Specific Visual Question Answering System** [[1]](#1). We also used the VQA pre-trained model from **BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation** [[]](#2) to make the model finetuning the two new models.
|
|
|
17 |
|
18 |
+
We used the datasets **Deep Fashion with Masks** available in <https://huggingface.co/datasets/SaffalPoosh/deepFashion-with-masks> and the **Control Net Dataset** available in <https://huggingface.co/datasets/ldhnam/deepfashion_controlnet>.
|
19 |
|
20 |
|
21 |
## References
|
22 |
<a id="1">[1]</a>
|
23 |
Min Wang and Ata Mahjoubfar and Anupama Joshi, 2022
|
24 |
+
FashionVQA: A Domain-Specific Visual Question Answering System
|
25 |
+
|
26 |
+
<a id="2">[2]</a>
|
27 |
+
Junnan Li and Dongxu Li and Caiming Xiong and Steven Hoi, 2022
|
28 |
+
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
|
app.py
CHANGED
@@ -7,7 +7,7 @@ inference = Inference()
|
|
7 |
|
8 |
|
9 |
with gr.Blocks() as block:
|
10 |
-
options = gr.Dropdown(choices=["Model 1", "Model 2"
|
11 |
# need to improve this one...
|
12 |
|
13 |
txt = gr.Textbox(label="Insert a question..", lines=2)
|
|
|
7 |
|
8 |
|
9 |
with gr.Blocks() as block:
|
10 |
+
options = gr.Dropdown(choices=["Model 1", "Model 2"], label="Models", info="Select the model to use..", )
|
11 |
# need to improve this one...
|
12 |
|
13 |
txt = gr.Textbox(label="Insert a question..", lines=2)
|
inference.py
CHANGED
@@ -1,4 +1,4 @@
|
|
1 |
-
from transformers import ViltProcessor, ViltForQuestionAnswering, Pix2StructProcessor, Pix2StructForConditionalGeneration
|
2 |
from transformers.utils import logging
|
3 |
|
4 |
class Inference:
|
|
|
1 |
+
from transformers import ViltProcessor, ViltForQuestionAnswering, Pix2StructProcessor, Pix2StructForConditionalGeneration
|
2 |
from transformers.utils import logging
|
3 |
|
4 |
class Inference:
|