MaskRIS: Semantic Distortion-aware Data Augmentation for Referring Image Segmentation
Abstract
Referring Image Segmentation (RIS) is an advanced vision-language task that involves identifying and segmenting objects within an image as described by free-form text descriptions. While previous studies focused on aligning visual and language features, exploring training techniques, such as data augmentation, remains underexplored. In this work, we explore effective data augmentation for RIS and propose a novel training framework called Masked Referring Image Segmentation (MaskRIS). We observe that the conventional image augmentations fall short of RIS, leading to performance degradation, while simple random masking significantly enhances the performance of RIS. MaskRIS uses both image and text masking, followed by Distortion-aware Contextual Learning (DCL) to fully exploit the benefits of the masking strategy. This approach can improve the model's robustness to occlusions, incomplete information, and various linguistic complexities, resulting in a significant performance improvement. Experiments demonstrate that MaskRIS can easily be applied to various RIS models, outperforming existing methods in both fully supervised and weakly supervised settings. Finally, MaskRIS achieves new state-of-the-art performance on RefCOCO, RefCOCO+, and RefCOCOg datasets. Code is available at https://github.com/naver-ai/maskris.
Community
We investigate data augmentation for Referring Image Segmentation (RIS) and propose a novel learning framework, MaskRIS, based on random masking. Unlike conventional augmentation, MaskRIS brings remarkable performance gains, outperforming state-of-the-art performance on the RIS benchmarks.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Masked Image Modeling Boosting Semi-Supervised Semantic Segmentation (2024)
- OneRef: Unified One-tower Expression Grounding and Segmentation with Mask Referring Modeling (2024)
- XMask3D: Cross-modal Mask Reasoning for Open Vocabulary 3D Semantic Segmentation (2024)
- CLIPer: Hierarchically Improving Spatial Representation of CLIP for Open-Vocabulary Semantic Segmentation (2024)
- Text4Seg: Reimagining Image Segmentation as Text Generation (2024)
- ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper