Spaces:
Runtime error
Runtime error
gchhablani
commited on
Commit
β’
3bd4b4e
1
Parent(s):
7f8d82b
Put FasterRCNN steps in Beta Expander
Browse files
apps/article.py
CHANGED
@@ -303,7 +303,10 @@ def app(state=None):
|
|
303 |
toc.placeholder()
|
304 |
|
305 |
toc.header("Introduction and Motivation")
|
306 |
-
st.write(read_markdown("intro/
|
|
|
|
|
|
|
307 |
toc.subheader("Novel Contributions")
|
308 |
st.write(read_markdown("intro/contributions.md"))
|
309 |
|
|
|
303 |
toc.placeholder()
|
304 |
|
305 |
toc.header("Introduction and Motivation")
|
306 |
+
st.write(read_markdown("intro/intro_part_1.md"))
|
307 |
+
with st.beta_expander("FasterRCNN Approach"):
|
308 |
+
st.write(read_markdown("intro/faster_rcnn_approach.md"))
|
309 |
+
st.write(read_markdown("intro/intro_part_2.md"))
|
310 |
toc.subheader("Novel Contributions")
|
311 |
st.write(read_markdown("intro/contributions.md"))
|
312 |
|
sections/intro/faster_rcnn_approach.md
ADDED
@@ -0,0 +1,6 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
For example, a FasterRCNN approach uses the following steps:
|
2 |
+
- the image features are given out by a FPN (Feature Pyramid Net) over a ResNet backbone, and
|
3 |
+
- then a RPN (Regision Proposal Network) layer detects proposals in those features, and
|
4 |
+
- then the ROI (Region of Interest) heads get the box proposals in the original image, and
|
5 |
+
- the the boxes are selected using a NMS (Non-max suppression),
|
6 |
+
- and then the features for selected boxes are used as visual features.
|
sections/intro/intro_part_1.md
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
Visual Question Answering (VQA) is a task where we expect the AI to answer a question about a given image. VQA has been an active area of research for the past 4-5 years, with most datasets using natural images found online. Two examples of such datasets are: [VQAv2](https://visualqa.org/challenge.html), [GQA](https://cs.stanford.edu/people/dorarad/gqa/about.html). VQA is a particularly interesting multi-modal machine learning challenge because it has several interesting applications across several domains including healthcare chatbots, interactive-agents, etc. **However, most VQA challenges or datasets deal with English-only captions and questions.**
|
2 |
+
|
3 |
+
In addition, even recent **approaches that have been proposed for VQA generally are obscure** due to the fact that CNN-based object detectors are relatively difficult to use and more complex for feature extraction. Click on the expandable region below to see steps for FasterRCNN-based approach.
|
sections/intro/{intro.md β intro_part_2.md}
RENAMED
@@ -1,12 +1,3 @@
|
|
1 |
-
Visual Question Answering (VQA) is a task where we expect the AI to answer a question about a given image. VQA has been an active area of research for the past 4-5 years, with most datasets using natural images found online. Two examples of such datasets are: [VQAv2](https://visualqa.org/challenge.html), [GQA](https://cs.stanford.edu/people/dorarad/gqa/about.html). VQA is a particularly interesting multi-modal machine learning challenge because it has several interesting applications across several domains including healthcare chatbots, interactive-agents, etc. **However, most VQA challenges or datasets deal with English-only captions and questions.**
|
2 |
-
|
3 |
-
In addition, even recent **approaches that have been proposed for VQA generally are obscure** due to the fact that CNN-based object detectors are relatively difficult to use and more complex for feature extraction. For example, a FasterRCNN approach uses the following steps:
|
4 |
-
- the image features are given out by a FPN (Feature Pyramid Net) over a ResNet backbone, and
|
5 |
-
- then a RPN (Regision Proposal Network) layer detects proposals in those features, and
|
6 |
-
- then the ROI (Region of Interest) heads get the box proposals in the original image, and
|
7 |
-
- the the boxes are selected using a NMS (Non-max suppression),
|
8 |
-
- and then the features for selected boxes are used as visual features.
|
9 |
-
|
10 |
A major **advantage that comes from using transformers is their simplicity and their accessibility** - thanks to HuggingFace, ViT and Transformers. For ViT models, for example, all one needs to do is pass the normalized images to the transformer.
|
11 |
|
12 |
While building a low-resource non-English VQA approach has several benefits of its own, a multilingual VQA task is interesting because it will help create a generic model that works well across several languages. And then, it can be fine-tuned in low-resource settings to leverage pre-training improvements. **With the aim of democratizing such a challenging yet interesting task, in this project, we focus on Mutilingual Visual Question Answering (MVQA)**. Our intention here is to provide a Proof-of-Concept with our simple CLIP-Vision-BERT baseline which leverages a multilingual checkpoint with pre-trained image encoders. Our model currently supports for four languages - **English, French, German and Spanish**.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
A major **advantage that comes from using transformers is their simplicity and their accessibility** - thanks to HuggingFace, ViT and Transformers. For ViT models, for example, all one needs to do is pass the normalized images to the transformer.
|
2 |
|
3 |
While building a low-resource non-English VQA approach has several benefits of its own, a multilingual VQA task is interesting because it will help create a generic model that works well across several languages. And then, it can be fine-tuned in low-resource settings to leverage pre-training improvements. **With the aim of democratizing such a challenging yet interesting task, in this project, we focus on Mutilingual Visual Question Answering (MVQA)**. Our intention here is to provide a Proof-of-Concept with our simple CLIP-Vision-BERT baseline which leverages a multilingual checkpoint with pre-trained image encoders. Our model currently supports for four languages - **English, French, German and Spanish**.
|