Papers
arxiv:2407.09033

Textual Query-Driven Mask Transformer for Domain Generalized Segmentation

Published on Jul 12, 2024
Authors:
,
,
,
,

Abstract

In this paper, we introduce a method to tackle Domain Generalized Semantic Segmentation (DGSS) by utilizing domain-invariant semantic knowledge from text embeddings of vision-language models. We employ the text embeddings as object queries within a transformer-based segmentation framework (textual object queries). These queries are regarded as a domain-invariant basis for pixel grouping in DGSS. To leverage the power of textual object queries, we introduce a novel framework named the textual query-driven mask transformer (tqdm). Our tqdm aims to (1) generate textual object queries that maximally encode domain-invariant semantics and (2) enhance the semantic clarity of dense visual features. Additionally, we suggest three regularization losses to improve the efficacy of tqdm by aligning between visual and textual features. By utilizing our method, the model can comprehend inherent semantic information for classes of interest, enabling it to generalize to extreme domains (e.g., sketch style). Our tqdm achieves 68.9 mIoU on GTA5rightarrowCityscapes, outperforming the prior state-of-the-art method by 2.5 mIoU. The project page is available at https://byeonghyunpak.github.io/tqdm.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2407.09033 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2407.09033 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2407.09033 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.