Bachmann Roman Christian commited on
Commit
3b49518
·
1 Parent(s): bb10473

Initial commit

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. .gitignore +1 -0
  2. FINETUNING.md +126 -0
  3. LICENSE +399 -0
  4. app.py +405 -0
  5. dpt/__init__.py +0 -0
  6. dpt/base_model.py +16 -0
  7. dpt/blocks.py +383 -0
  8. dpt/midas_net.py +77 -0
  9. dpt/models.py +153 -0
  10. dpt/transforms.py +231 -0
  11. dpt/vit.py +576 -0
  12. mask2former/__init__.py +26 -0
  13. mask2former/config.py +114 -0
  14. mask2former/configs/ade20k/instance-segmentation/Base-ADE20K-InstanceSegmentation.yaml +61 -0
  15. mask2former/configs/ade20k/instance-segmentation/maskformer2_R50_bs16_160k.yaml +44 -0
  16. mask2former/configs/ade20k/instance-segmentation/swin/maskformer2_swin_large_IN21k_384_bs16_160k.yaml +18 -0
  17. mask2former/configs/ade20k/panoptic-segmentation/Base-ADE20K-PanopticSegmentation.yaml +61 -0
  18. mask2former/configs/ade20k/panoptic-segmentation/maskformer2_R50_bs16_160k.yaml +44 -0
  19. mask2former/configs/ade20k/panoptic-segmentation/swin/maskformer2_swin_large_IN21k_384_bs16_160k.yaml +18 -0
  20. mask2former/configs/ade20k/semantic-segmentation/Base-ADE20K-SemanticSegmentation.yaml +61 -0
  21. mask2former/configs/ade20k/semantic-segmentation/maskformer2_R101_bs16_90k.yaml +11 -0
  22. mask2former/configs/ade20k/semantic-segmentation/maskformer2_R50_bs16_160k.yaml +44 -0
  23. mask2former/configs/ade20k/semantic-segmentation/swin/maskformer2_swin_base_384_bs16_160k_res640.yaml +37 -0
  24. mask2former/configs/ade20k/semantic-segmentation/swin/maskformer2_swin_base_IN21k_384_bs16_160k_res640.yaml +37 -0
  25. mask2former/configs/ade20k/semantic-segmentation/swin/maskformer2_swin_large_IN21k_384_bs16_160k_res640.yaml +37 -0
  26. mask2former/configs/ade20k/semantic-segmentation/swin/maskformer2_swin_small_bs16_160k.yaml +15 -0
  27. mask2former/configs/ade20k/semantic-segmentation/swin/maskformer2_swin_tiny_bs16_160k.yaml +15 -0
  28. mask2former/configs/cityscapes/instance-segmentation/Base-Cityscapes-InstanceSegmentation.yaml +61 -0
  29. mask2former/configs/cityscapes/instance-segmentation/maskformer2_R101_bs16_90k.yaml +11 -0
  30. mask2former/configs/cityscapes/instance-segmentation/maskformer2_R50_bs16_90k.yaml +44 -0
  31. mask2former/configs/cityscapes/instance-segmentation/swin/maskformer2_swin_base_IN21k_384_bs16_90k.yaml +16 -0
  32. mask2former/configs/cityscapes/instance-segmentation/swin/maskformer2_swin_large_IN21k_384_bs16_90k.yaml +18 -0
  33. mask2former/configs/cityscapes/instance-segmentation/swin/maskformer2_swin_small_bs16_90k.yaml +15 -0
  34. mask2former/configs/cityscapes/instance-segmentation/swin/maskformer2_swin_tiny_bs16_90k.yaml +15 -0
  35. mask2former/configs/cityscapes/panoptic-segmentation/Base-Cityscapes-PanopticSegmentation.yaml +61 -0
  36. mask2former/configs/cityscapes/panoptic-segmentation/maskformer2_R101_bs16_90k.yaml +11 -0
  37. mask2former/configs/cityscapes/panoptic-segmentation/maskformer2_R50_bs16_90k.yaml +44 -0
  38. mask2former/configs/cityscapes/panoptic-segmentation/swin/maskformer2_swin_base_IN21k_384_bs16_90k.yaml +16 -0
  39. mask2former/configs/cityscapes/panoptic-segmentation/swin/maskformer2_swin_large_IN21k_384_bs16_90k.yaml +18 -0
  40. mask2former/configs/cityscapes/panoptic-segmentation/swin/maskformer2_swin_small_bs16_90k.yaml +15 -0
  41. mask2former/configs/cityscapes/panoptic-segmentation/swin/maskformer2_swin_tiny_bs16_90k.yaml +15 -0
  42. mask2former/configs/cityscapes/semantic-segmentation/Base-Cityscapes-SemanticSegmentation.yaml +61 -0
  43. mask2former/configs/cityscapes/semantic-segmentation/maskformer2_R101_bs16_90k.yaml +11 -0
  44. mask2former/configs/cityscapes/semantic-segmentation/maskformer2_R50_bs16_90k.yaml +44 -0
  45. mask2former/configs/cityscapes/semantic-segmentation/swin/maskformer2_swin_base_IN21k_384_bs16_90k.yaml +16 -0
  46. mask2former/configs/cityscapes/semantic-segmentation/swin/maskformer2_swin_large_IN21k_384_bs16_90k.yaml +18 -0
  47. mask2former/configs/cityscapes/semantic-segmentation/swin/maskformer2_swin_small_bs16_90k.yaml +15 -0
  48. mask2former/configs/cityscapes/semantic-segmentation/swin/maskformer2_swin_tiny_bs16_90k.yaml +15 -0
  49. mask2former/configs/coco/instance-segmentation/Base-COCO-InstanceSegmentation.yaml +47 -0
  50. mask2former/configs/coco/instance-segmentation/maskformer2_R101_bs16_50ep.yaml +11 -0
.gitignore ADDED
@@ -0,0 +1 @@
 
 
1
+ .DS_Store
FINETUNING.md ADDED
@@ -0,0 +1,126 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Fine-tuning
2
+
3
+ We provide fine-tuning scripts for classification, semantic segmentation, depth estimation and more.
4
+ Please check [SETUP.md](SETUP.md) for set-up instructions first.
5
+
6
+ - [General information](#general-information)
7
+ - [Classification](#classification)
8
+ - [Semantic segmentation](#semantic-segmentation)
9
+ - [Depth estimation](#depth-estimation)
10
+ - [Taskonomy tasks](#taskonomy-tasks)
11
+
12
+ ## General information
13
+
14
+ ### Loading pre-trained models
15
+
16
+ All our fine-tuning scripts support models in the MultiMAE / MultiViT format. Pre-trained models using the timm / ViT format can be converted to this format using the [`vit2multimae_converter.py`](tools/vit2multimae_converter.py)
17
+ script. More information can be found [here](README.md#model-formats).
18
+
19
+ ### Modifying configs
20
+ The training scripts support both YAML config files and command-line arguments. See [here](cfgs/finetune) for all fine-tuning config files.
21
+
22
+ To modify fine-training settings, either edit / add config files or provide additional command-line arguments.
23
+
24
+ :information_source: Config files arguments override default arguments, and command-line arguments override both default arguments and config arguments.
25
+
26
+ :warning: When changing settings (e.g., using a different pre-trained model), make sure to modify the `output_dir` and `wandb_run_name` (if logging is activated) to reflect the changes.
27
+
28
+
29
+ ### Experiment logging
30
+ To activate logging to [Weights & Biases](https://docs.wandb.ai/), either edit the config files or use the `--log_wandb` flag along with any other extra logging arguments.
31
+
32
+
33
+ ## Classification
34
+
35
+ We use 8 A100 GPUs for classification fine-tuning. Configs can be found [here](cfgs/finetune/cls).
36
+
37
+ To fine-tune MultiMAE on ImageNet-1K classification using default settings, run:
38
+ ```bash
39
+ OMP_NUM_THREADS=1 torchrun --nproc_per_node=8 run_finetuning_cls.py \
40
+ --config cfgs/finetune/cls/ft_in1k_100e_multimae-b.yaml \
41
+ --finetune /path/to/multimae_weights \
42
+ --data_path /path/to/in1k/train/rgb \
43
+ --eval_data_path /path/to/in1k/val/rgb
44
+ ```
45
+
46
+ - For a list of possible arguments, see [`run_finetuning_cls.py`](run_finetuning_cls.py).
47
+
48
+ ## Semantic segmentation
49
+
50
+ We use 4 A100 GPUs for semantic segmentation fine-tuning. Configs can be found [here](cfgs/finetune/semseg).
51
+
52
+ ### ADE20K
53
+ To fine-tune MultiMAE on ADE20K semantic segmentation with default settings and **RGB** as the input modality, run:
54
+ ```bash
55
+ OMP_NUM_THREADS=1 torchrun --nproc_per_node=4 run_finetuning_semseg.py \
56
+ --config cfgs/finetune/semseg/ade/ft_ade_64e_multimae-b_rgb.yaml \
57
+ --finetune /path/to/multimae_weights \
58
+ --data_path /path/to/ade20k/train \
59
+ --eval_data_path /path/to/ade20k/val
60
+ ```
61
+
62
+ - For a list of possible arguments, see [`run_finetuning_semseg.py`](run_finetuning_semseg.py).
63
+
64
+
65
+ ### Hypersim
66
+ To fine-tune MultiMAE on Hypersim semantic segmentation with default settings and **RGB** as the input modality, run:
67
+ ```bash
68
+ OMP_NUM_THREADS=1 torchrun --nproc_per_node=4 run_finetuning_semseg.py \
69
+ --config cfgs/finetune/semseg/hypersim/ft_hypersim_25e_multimae-b_rgb.yaml \
70
+ --finetune /path/to/multimae_weights \
71
+ --data_path /path/to/hypersim/train \
72
+ --eval_data_path /path/to/hypersim/val
73
+ ```
74
+
75
+ - To fine-tune using **depth-only** and **RGB + depth** as the input modalities, simply swap the config file to the appropriate one.
76
+ - For a list of possible arguments, see [`run_finetuning_semseg.py`](run_finetuning_semseg.py).
77
+
78
+
79
+
80
+ ### NYUv2
81
+ To fine-tune MultiMAE on NYUv2 semantic segmentation with default settings and **RGB** as the input modality, run:
82
+ ```bash
83
+ OMP_NUM_THREADS=1 torchrun --nproc_per_node=4 run_finetuning_semseg.py \
84
+ --config cfgs/finetune/semseg/nyu/ft_nyu_200e_multimae-b_rgb.yaml \
85
+ --finetune /path/to/multimae_weights \
86
+ --data_path /path/to/nyu/train \
87
+ --eval_data_path /path/to/nyu/test_or_val
88
+ ```
89
+
90
+ - To fine-tune using **depth-only** and **RGB + depth** as the input modalities, simply swap the config file to the appropriate one.
91
+ - For a list of possible arguments, see [`run_finetuning_semseg.py`](run_finetuning_semseg.py).
92
+
93
+
94
+ ## Depth estimation
95
+
96
+ We use 2 A100 GPUs for depth estimation fine-tuning. Configs can be found [here](cfgs/finetune/depth).
97
+
98
+
99
+ To fine-tune MultiMAE on NYUv2 depth estimation with default settings, run:
100
+ ```bash
101
+ OMP_NUM_THREADS=1 torchrun --nproc_per_node=2 run_finetuning_depth.py \
102
+ --config cfgs/finetune/depth/ft_nyu_2000e_multimae-b.yaml \
103
+ --finetune /path/to/multimae_weights \
104
+ --data_path /path/to/nyu/train \
105
+ --eval_data_path /path/to/nyu/test_or_val
106
+ ```
107
+ - For a list of possible arguments, see [`run_finetuning_depth.py`](run_finetuning_depth.py).
108
+
109
+ ## Taskonomy tasks
110
+
111
+ We use 1 A100 GPU to fine-tune on Taskonomy tasks. Configs can be found [here](cfgs/finetune/taskonomy).
112
+
113
+ The tasks we support are: Principal curvature, z-buffer depth, texture edges, occlusion edges, 2D keypoints,
114
+ 3D keypoints, surface normals, and reshading.
115
+
116
+
117
+ For example, to fine-tune MultiMAE on Taskonomy reshading with default settings, run:
118
+ ```bash
119
+ OMP_NUM_THREADS=1 torchrun --nproc_per_node=1 run_finetuning_taskonomy.py \
120
+ --config cfgs/finetune/taskonomy/rgb2reshading-1k/ft_rgb2reshading_multimae-b.yaml \
121
+ --finetune /path/to/multimae_weights \
122
+ --data_path /path/to/taskonomy_tiny
123
+ ```
124
+
125
+ - To fine-tune on a different task, simply swap the config file to the appropriate one.
126
+ - For a list of possible arguments, see [`run_finetuning_taskonomy.py`](run_finetuning_taskonomy.py).
LICENSE ADDED
@@ -0,0 +1,399 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Attribution-NonCommercial 4.0 International
2
+
3
+ =======================================================================
4
+
5
+ Creative Commons Corporation ("Creative Commons") is not a law firm and
6
+ does not provide legal services or legal advice. Distribution of
7
+ Creative Commons public licenses does not create a lawyer-client or
8
+ other relationship. Creative Commons makes its licenses and related
9
+ information available on an "as-is" basis. Creative Commons gives no
10
+ warranties regarding its licenses, any material licensed under their
11
+ terms and conditions, or any related information. Creative Commons
12
+ disclaims all liability for damages resulting from their use to the
13
+ fullest extent possible.
14
+
15
+ Using Creative Commons Public Licenses
16
+
17
+ Creative Commons public licenses provide a standard set of terms and
18
+ conditions that creators and other rights holders may use to share
19
+ original works of authorship and other material subject to copyright
20
+ and certain other rights specified in the public license below. The
21
+ following considerations are for informational purposes only, are not
22
+ exhaustive, and do not form part of our licenses.
23
+
24
+ Considerations for licensors: Our public licenses are
25
+ intended for use by those authorized to give the public
26
+ permission to use material in ways otherwise restricted by
27
+ copyright and certain other rights. Our licenses are
28
+ irrevocable. Licensors should read and understand the terms
29
+ and conditions of the license they choose before applying it.
30
+ Licensors should also secure all rights necessary before
31
+ applying our licenses so that the public can reuse the
32
+ material as expected. Licensors should clearly mark any
33
+ material not subject to the license. This includes other CC-
34
+ licensed material, or material used under an exception or
35
+ limitation to copyright. More considerations for licensors:
36
+ wiki.creativecommons.org/Considerations_for_licensors
37
+
38
+ Considerations for the public: By using one of our public
39
+ licenses, a licensor grants the public permission to use the
40
+ licensed material under specified terms and conditions. If
41
+ the licensor's permission is not necessary for any reason--for
42
+ example, because of any applicable exception or limitation to
43
+ copyright--then that use is not regulated by the license. Our
44
+ licenses grant only permissions under copyright and certain
45
+ other rights that a licensor has authority to grant. Use of
46
+ the licensed material may still be restricted for other
47
+ reasons, including because others have copyright or other
48
+ rights in the material. A licensor may make special requests,
49
+ such as asking that all changes be marked or described.
50
+ Although not required by our licenses, you are encouraged to
51
+ respect those requests where reasonable. More_considerations
52
+ for the public:
53
+ wiki.creativecommons.org/Considerations_for_licensees
54
+
55
+ =======================================================================
56
+
57
+ Creative Commons Attribution-NonCommercial 4.0 International Public
58
+ License
59
+
60
+ By exercising the Licensed Rights (defined below), You accept and agree
61
+ to be bound by the terms and conditions of this Creative Commons
62
+ Attribution-NonCommercial 4.0 International Public License ("Public
63
+ License"). To the extent this Public License may be interpreted as a
64
+ contract, You are granted the Licensed Rights in consideration of Your
65
+ acceptance of these terms and conditions, and the Licensor grants You
66
+ such rights in consideration of benefits the Licensor receives from
67
+ making the Licensed Material available under these terms and
68
+ conditions.
69
+
70
+ Section 1 -- Definitions.
71
+
72
+ a. Adapted Material means material subject to Copyright and Similar
73
+ Rights that is derived from or based upon the Licensed Material
74
+ and in which the Licensed Material is translated, altered,
75
+ arranged, transformed, or otherwise modified in a manner requiring
76
+ permission under the Copyright and Similar Rights held by the
77
+ Licensor. For purposes of this Public License, where the Licensed
78
+ Material is a musical work, performance, or sound recording,
79
+ Adapted Material is always produced where the Licensed Material is
80
+ synched in timed relation with a moving image.
81
+
82
+ b. Adapter's License means the license You apply to Your Copyright
83
+ and Similar Rights in Your contributions to Adapted Material in
84
+ accordance with the terms and conditions of this Public License.
85
+
86
+ c. Copyright and Similar Rights means copyright and/or similar rights
87
+ closely related to copyright including, without limitation,
88
+ performance, broadcast, sound recording, and Sui Generis Database
89
+ Rights, without regard to how the rights are labeled or
90
+ categorized. For purposes of this Public License, the rights
91
+ specified in Section 2(b)(1)-(2) are not Copyright and Similar
92
+ Rights.
93
+ d. Effective Technological Measures means those measures that, in the
94
+ absence of proper authority, may not be circumvented under laws
95
+ fulfilling obligations under Article 11 of the WIPO Copyright
96
+ Treaty adopted on December 20, 1996, and/or similar international
97
+ agreements.
98
+
99
+ e. Exceptions and Limitations means fair use, fair dealing, and/or
100
+ any other exception or limitation to Copyright and Similar Rights
101
+ that applies to Your use of the Licensed Material.
102
+
103
+ f. Licensed Material means the artistic or literary work, database,
104
+ or other material to which the Licensor applied this Public
105
+ License.
106
+
107
+ g. Licensed Rights means the rights granted to You subject to the
108
+ terms and conditions of this Public License, which are limited to
109
+ all Copyright and Similar Rights that apply to Your use of the
110
+ Licensed Material and that the Licensor has authority to license.
111
+
112
+ h. Licensor means the individual(s) or entity(ies) granting rights
113
+ under this Public License.
114
+
115
+ i. NonCommercial means not primarily intended for or directed towards
116
+ commercial advantage or monetary compensation. For purposes of
117
+ this Public License, the exchange of the Licensed Material for
118
+ other material subject to Copyright and Similar Rights by digital
119
+ file-sharing or similar means is NonCommercial provided there is
120
+ no payment of monetary compensation in connection with the
121
+ exchange.
122
+
123
+ j. Share means to provide material to the public by any means or
124
+ process that requires permission under the Licensed Rights, such
125
+ as reproduction, public display, public performance, distribution,
126
+ dissemination, communication, or importation, and to make material
127
+ available to the public including in ways that members of the
128
+ public may access the material from a place and at a time
129
+ individually chosen by them.
130
+
131
+ k. Sui Generis Database Rights means rights other than copyright
132
+ resulting from Directive 96/9/EC of the European Parliament and of
133
+ the Council of 11 March 1996 on the legal protection of databases,
134
+ as amended and/or succeeded, as well as other essentially
135
+ equivalent rights anywhere in the world.
136
+
137
+ l. You means the individual or entity exercising the Licensed Rights
138
+ under this Public License. Your has a corresponding meaning.
139
+
140
+ Section 2 -- Scope.
141
+
142
+ a. License grant.
143
+
144
+ 1. Subject to the terms and conditions of this Public License,
145
+ the Licensor hereby grants You a worldwide, royalty-free,
146
+ non-sublicensable, non-exclusive, irrevocable license to
147
+ exercise the Licensed Rights in the Licensed Material to:
148
+
149
+ a. reproduce and Share the Licensed Material, in whole or
150
+ in part, for NonCommercial purposes only; and
151
+
152
+ b. produce, reproduce, and Share Adapted Material for
153
+ NonCommercial purposes only.
154
+
155
+ 2. Exceptions and Limitations. For the avoidance of doubt, where
156
+ Exceptions and Limitations apply to Your use, this Public
157
+ License does not apply, and You do not need to comply with
158
+ its terms and conditions.
159
+
160
+ 3. Term. The term of this Public License is specified in Section
161
+ 6(a).
162
+
163
+ 4. Media and formats; technical modifications allowed. The
164
+ Licensor authorizes You to exercise the Licensed Rights in
165
+ all media and formats whether now known or hereafter created,
166
+ and to make technical modifications necessary to do so. The
167
+ Licensor waives and/or agrees not to assert any right or
168
+ authority to forbid You from making technical modifications
169
+ necessary to exercise the Licensed Rights, including
170
+ technical modifications necessary to circumvent Effective
171
+ Technological Measures. For purposes of this Public License,
172
+ simply making modifications authorized by this Section 2(a)
173
+ (4) never produces Adapted Material.
174
+
175
+ 5. Downstream recipients.
176
+
177
+ a. Offer from the Licensor -- Licensed Material. Every
178
+ recipient of the Licensed Material automatically
179
+ receives an offer from the Licensor to exercise the
180
+ Licensed Rights under the terms and conditions of this
181
+ Public License.
182
+
183
+ b. No downstream restrictions. You may not offer or impose
184
+ any additional or different terms or conditions on, or
185
+ apply any Effective Technological Measures to, the
186
+ Licensed Material if doing so restricts exercise of the
187
+ Licensed Rights by any recipient of the Licensed
188
+ Material.
189
+
190
+ 6. No endorsement. Nothing in this Public License constitutes or
191
+ may be construed as permission to assert or imply that You
192
+ are, or that Your use of the Licensed Material is, connected
193
+ with, or sponsored, endorsed, or granted official status by,
194
+ the Licensor or others designated to receive attribution as
195
+ provided in Section 3(a)(1)(A)(i).
196
+
197
+ b. Other rights.
198
+
199
+ 1. Moral rights, such as the right of integrity, are not
200
+ licensed under this Public License, nor are publicity,
201
+ privacy, and/or other similar personality rights; however, to
202
+ the extent possible, the Licensor waives and/or agrees not to
203
+ assert any such rights held by the Licensor to the limited
204
+ extent necessary to allow You to exercise the Licensed
205
+ Rights, but not otherwise.
206
+
207
+ 2. Patent and trademark rights are not licensed under this
208
+ Public License.
209
+
210
+ 3. To the extent possible, the Licensor waives any right to
211
+ collect royalties from You for the exercise of the Licensed
212
+ Rights, whether directly or through a collecting society
213
+ under any voluntary or waivable statutory or compulsory
214
+ licensing scheme. In all other cases the Licensor expressly
215
+ reserves any right to collect such royalties, including when
216
+ the Licensed Material is used other than for NonCommercial
217
+ purposes.
218
+
219
+ Section 3 -- License Conditions.
220
+
221
+ Your exercise of the Licensed Rights is expressly made subject to the
222
+ following conditions.
223
+
224
+ a. Attribution.
225
+
226
+ 1. If You Share the Licensed Material (including in modified
227
+ form), You must:
228
+
229
+ a. retain the following if it is supplied by the Licensor
230
+ with the Licensed Material:
231
+
232
+ i. identification of the creator(s) of the Licensed
233
+ Material and any others designated to receive
234
+ attribution, in any reasonable manner requested by
235
+ the Licensor (including by pseudonym if
236
+ designated);
237
+
238
+ ii. a copyright notice;
239
+
240
+ iii. a notice that refers to this Public License;
241
+
242
+ iv. a notice that refers to the disclaimer of
243
+ warranties;
244
+
245
+ v. a URI or hyperlink to the Licensed Material to the
246
+ extent reasonably practicable;
247
+
248
+ b. indicate if You modified the Licensed Material and
249
+ retain an indication of any previous modifications; and
250
+
251
+ c. indicate the Licensed Material is licensed under this
252
+ Public License, and include the text of, or the URI or
253
+ hyperlink to, this Public License.
254
+
255
+ 2. You may satisfy the conditions in Section 3(a)(1) in any
256
+ reasonable manner based on the medium, means, and context in
257
+ which You Share the Licensed Material. For example, it may be
258
+ reasonable to satisfy the conditions by providing a URI or
259
+ hyperlink to a resource that includes the required
260
+ information.
261
+
262
+ 3. If requested by the Licensor, You must remove any of the
263
+ information required by Section 3(a)(1)(A) to the extent
264
+ reasonably practicable.
265
+
266
+ 4. If You Share Adapted Material You produce, the Adapter's
267
+ License You apply must not prevent recipients of the Adapted
268
+ Material from complying with this Public License.
269
+
270
+ Section 4 -- Sui Generis Database Rights.
271
+
272
+ Where the Licensed Rights include Sui Generis Database Rights that
273
+ apply to Your use of the Licensed Material:
274
+
275
+ a. for the avoidance of doubt, Section 2(a)(1) grants You the right
276
+ to extract, reuse, reproduce, and Share all or a substantial
277
+ portion of the contents of the database for NonCommercial purposes
278
+ only;
279
+
280
+ b. if You include all or a substantial portion of the database
281
+ contents in a database in which You have Sui Generis Database
282
+ Rights, then the database in which You have Sui Generis Database
283
+ Rights (but not its individual contents) is Adapted Material; and
284
+
285
+ c. You must comply with the conditions in Section 3(a) if You Share
286
+ all or a substantial portion of the contents of the database.
287
+
288
+ For the avoidance of doubt, this Section 4 supplements and does not
289
+ replace Your obligations under this Public License where the Licensed
290
+ Rights include other Copyright and Similar Rights.
291
+
292
+ Section 5 -- Disclaimer of Warranties and Limitation of Liability.
293
+
294
+ a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE
295
+ EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS
296
+ AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF
297
+ ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS,
298
+ IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION,
299
+ WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR
300
+ PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS,
301
+ ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT
302
+ KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT
303
+ ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU.
304
+
305
+ b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE
306
+ TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION,
307
+ NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT,
308
+ INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES,
309
+ COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR
310
+ USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN
311
+ ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR
312
+ DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR
313
+ IN PART, THIS LIMITATION MAY NOT APPLY TO YOU.
314
+
315
+ c. The disclaimer of warranties and limitation of liability provided
316
+ above shall be interpreted in a manner that, to the extent
317
+ possible, most closely approximates an absolute disclaimer and
318
+ waiver of all liability.
319
+
320
+ Section 6 -- Term and Termination.
321
+
322
+ a. This Public License applies for the term of the Copyright and
323
+ Similar Rights licensed here. However, if You fail to comply with
324
+ this Public License, then Your rights under this Public License
325
+ terminate automatically.
326
+
327
+ b. Where Your right to use the Licensed Material has terminated under
328
+ Section 6(a), it reinstates:
329
+
330
+ 1. automatically as of the date the violation is cured, provided
331
+ it is cured within 30 days of Your discovery of the
332
+ violation; or
333
+
334
+ 2. upon express reinstatement by the Licensor.
335
+
336
+ For the avoidance of doubt, this Section 6(b) does not affect any
337
+ right the Licensor may have to seek remedies for Your violations
338
+ of this Public License.
339
+
340
+ c. For the avoidance of doubt, the Licensor may also offer the
341
+ Licensed Material under separate terms or conditions or stop
342
+ distributing the Licensed Material at any time; however, doing so
343
+ will not terminate this Public License.
344
+
345
+ d. Sections 1, 5, 6, 7, and 8 survive termination of this Public
346
+ License.
347
+
348
+ Section 7 -- Other Terms and Conditions.
349
+
350
+ a. The Licensor shall not be bound by any additional or different
351
+ terms or conditions communicated by You unless expressly agreed.
352
+
353
+ b. Any arrangements, understandings, or agreements regarding the
354
+ Licensed Material not stated herein are separate from and
355
+ independent of the terms and conditions of this Public License.
356
+
357
+ Section 8 -- Interpretation.
358
+
359
+ a. For the avoidance of doubt, this Public License does not, and
360
+ shall not be interpreted to, reduce, limit, restrict, or impose
361
+ conditions on any use of the Licensed Material that could lawfully
362
+ be made without permission under this Public License.
363
+
364
+ b. To the extent possible, if any provision of this Public License is
365
+ deemed unenforceable, it shall be automatically reformed to the
366
+ minimum extent necessary to make it enforceable. If the provision
367
+ cannot be reformed, it shall be severed from this Public License
368
+ without affecting the enforceability of the remaining terms and
369
+ conditions.
370
+
371
+ c. No term or condition of this Public License will be waived and no
372
+ failure to comply consented to unless expressly agreed to by the
373
+ Licensor.
374
+
375
+ d. Nothing in this Public License constitutes or may be interpreted
376
+ as a limitation upon, or waiver of, any privileges and immunities
377
+ that apply to the Licensor or You, including from the legal
378
+ processes of any jurisdiction or authority.
379
+
380
+ =======================================================================
381
+
382
+ Creative Commons is not a party to its public
383
+ licenses. Notwithstanding, Creative Commons may elect to apply one of
384
+ its public licenses to material it publishes and in those instances
385
+ will be considered the “Licensor.” The text of the Creative Commons
386
+ public licenses is dedicated to the public domain under the CC0 Public
387
+ Domain Dedication. Except for the limited purpose of indicating that
388
+ material is shared under a Creative Commons public license or as
389
+ otherwise permitted by the Creative Commons policies published at
390
+ creativecommons.org/policies, Creative Commons does not authorize the
391
+ use of the trademark "Creative Commons" or any other trademark or logo
392
+ of Creative Commons without its prior written consent including,
393
+ without limitation, in connection with any unauthorized modifications
394
+ to any of its public licenses or any other arrangements,
395
+ understandings, or agreements concerning use of licensed material. For
396
+ the avoidance of doubt, this paragraph does not form part of the
397
+ public licenses.
398
+
399
+ Creative Commons may be contacted at creativecommons.org.
app.py ADDED
@@ -0,0 +1,405 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import sys, os
2
+ import torch
3
+ TORCH_VERSION = ".".join(torch.__version__.split(".")[:2])
4
+ CUDA_VERSION = torch.__version__.split("+")[-1]
5
+ print("torch: ", TORCH_VERSION, "; cuda: ", CUDA_VERSION)
6
+ # Install detectron2 that matches the above pytorch version
7
+ # See https://detectron2.readthedocs.io/tutorials/install.html for instructions
8
+ os.system(f'pip install detectron2 -f https://dl.fbaipublicfiles.com/detectron2/wheels/{CUDA_VERSION}/torch{TORCH_VERSION}/index.html')
9
+ os.system("pip install git+https://github.com/cocodataset/panopticapi.git")
10
+
11
+ # Imports
12
+ import gradio as gr
13
+ import detectron2
14
+ from detectron2.utils.logger import setup_logger
15
+ import numpy as np
16
+ import cv2
17
+ import torch
18
+ import torch.nn.functional as F
19
+ import torchvision.transforms.functional as TF
20
+ from torchvision import datasets, transforms
21
+ from einops import rearrange
22
+ from PIL import Image
23
+ import imutils
24
+ import matplotlib.pyplot as plt
25
+ from mpl_toolkits.axes_grid1 import ImageGrid
26
+ from tqdm import tqdm
27
+ import random
28
+ from functools import partial
29
+
30
+ # import some common detectron2 utilities
31
+ from detectron2 import model_zoo
32
+ from detectron2.engine import DefaultPredictor
33
+ from detectron2.config import get_cfg
34
+ from detectron2.utils.visualizer import Visualizer, ColorMode
35
+ from detectron2.data import MetadataCatalog
36
+ from detectron2.projects.deeplab import add_deeplab_config
37
+ coco_metadata = MetadataCatalog.get("coco_2017_val_panoptic")
38
+
39
+ # Import Mask2Former
40
+ from mask2former import add_maskformer2_config
41
+
42
+ # DPT dependencies for depth pseudo labeling
43
+ from dpt.models import DPTDepthModel
44
+
45
+ from multimae.input_adapters import PatchedInputAdapter, SemSegInputAdapter
46
+ from multimae.output_adapters import SpatialOutputAdapter
47
+ from multimae.multimae import pretrain_multimae_base
48
+ from utils.data_constants import IMAGENET_DEFAULT_MEAN, IMAGENET_DEFAULT_STD
49
+
50
+ torch.set_grad_enabled(False)
51
+ device = 'cuda' if torch.cuda.is_available() else 'cpu'
52
+ print(f'device: {device}')
53
+
54
+
55
+ # Initialize COCO Mask2Former
56
+ cfg = get_cfg()
57
+ cfg.MODEL.DEVICE='cpu'
58
+ add_deeplab_config(cfg)
59
+ add_maskformer2_config(cfg)
60
+ cfg.merge_from_file("mask2former/configs/coco/panoptic-segmentation/swin/maskformer2_swin_small_bs16_50ep.yaml")
61
+ cfg.MODEL.WEIGHTS = 'https://dl.fbaipublicfiles.com/maskformer/mask2former/coco/panoptic/maskformer2_swin_small_bs16_50ep/model_final_a407fd.pkl'
62
+ cfg.MODEL.MASK_FORMER.TEST.SEMANTIC_ON = True
63
+ cfg.MODEL.MASK_FORMER.TEST.INSTANCE_ON = True
64
+ cfg.MODEL.MASK_FORMER.TEST.PANOPTIC_ON = True
65
+ semseg_model = DefaultPredictor(cfg)
66
+
67
+ def predict_semseg(img):
68
+ return semseg_model(255*img.permute(1,2,0).numpy())['sem_seg'].argmax(0)
69
+
70
+ def plot_semseg(img, semseg, ax):
71
+ v = Visualizer(img.permute(1,2,0), coco_metadata, scale=1.2, instance_mode=ColorMode.IMAGE_BW)
72
+ semantic_result = v.draw_sem_seg(semseg.cpu()).get_image()
73
+ ax.imshow(semantic_result)
74
+
75
+
76
+ # Initialize Omnidata depth model
77
+ os.system("wget https://drive.switch.ch/index.php/s/RFfTZwyKROKKx0l/download")
78
+ os.system("unzip -j download -d pretrained_models")
79
+ os.system("rm download")
80
+
81
+ omnidata_ckpt = torch.load('./pretrained_models/omnidata_rgb2depth_dpt_hybrid.pth', map_location='cpu')
82
+ depth_model = DPTDepthModel()
83
+ depth_model.load_state_dict(omnidata_ckpt)
84
+ depth_model = depth_model.to(device).eval()
85
+
86
+ def predict_depth(img):
87
+ depth_model_input = (img.unsqueeze(0) - 0.5) / 0.5
88
+ return depth_model(depth_model_input.to(device))
89
+
90
+
91
+ # MultiMAE model setup
92
+ DOMAIN_CONF = {
93
+ 'rgb': {
94
+ 'input_adapter': partial(PatchedInputAdapter, num_channels=3, stride_level=1),
95
+ 'output_adapter': partial(SpatialOutputAdapter, num_channels=3, stride_level=1),
96
+ },
97
+ 'depth': {
98
+ 'input_adapter': partial(PatchedInputAdapter, num_channels=1, stride_level=1),
99
+ 'output_adapter': partial(SpatialOutputAdapter, num_channels=1, stride_level=1),
100
+ },
101
+ 'semseg': {
102
+ 'input_adapter': partial(SemSegInputAdapter, num_classes=133,
103
+ dim_class_emb=64, interpolate_class_emb=False, stride_level=4),
104
+ 'output_adapter': partial(SpatialOutputAdapter, num_channels=133, stride_level=4),
105
+ },
106
+ }
107
+ DOMAINS = ['rgb', 'depth', 'semseg']
108
+
109
+ input_adapters = {
110
+ domain: dinfo['input_adapter'](
111
+ patch_size_full=16,
112
+ )
113
+ for domain, dinfo in DOMAIN_CONF.items()
114
+ }
115
+ output_adapters = {
116
+ domain: dinfo['output_adapter'](
117
+ patch_size_full=16,
118
+ dim_tokens=256,
119
+ use_task_queries=True,
120
+ depth=2,
121
+ context_tasks=DOMAINS,
122
+ task=domain
123
+ )
124
+ for domain, dinfo in DOMAIN_CONF.items()
125
+ }
126
+
127
+ multimae = pretrain_multimae_base(
128
+ input_adapters=input_adapters,
129
+ output_adapters=output_adapters,
130
+ )
131
+
132
+ CKPT_URL = 'https://github.com/EPFL-VILAB/MultiMAE/releases/download/pretrained-weights/multimae-b_98_rgb+-depth-semseg_1600e_multivit-afff3f8c.pth'
133
+ ckpt = torch.hub.load_state_dict_from_url(CKPT_URL, map_location='cpu')
134
+ multimae.load_state_dict(ckpt['model'], strict=False)
135
+ multimae = multimae.to(device).eval()
136
+
137
+
138
+ # Plotting
139
+
140
+ def get_masked_image(img, mask, image_size=224, patch_size=16, mask_value=0.0):
141
+ img_token = rearrange(
142
+ img.detach().cpu(),
143
+ 'b c (nh ph) (nw pw) -> b (nh nw) (c ph pw)',
144
+ ph=patch_size, pw=patch_size, nh=image_size//patch_size, nw=image_size//patch_size
145
+ )
146
+ img_token[mask.detach().cpu()!=0] = mask_value
147
+ img = rearrange(
148
+ img_token,
149
+ 'b (nh nw) (c ph pw) -> b c (nh ph) (nw pw)',
150
+ ph=patch_size, pw=patch_size, nh=image_size//patch_size, nw=image_size//patch_size
151
+ )
152
+ return img
153
+
154
+
155
+ def denormalize(img, mean=IMAGENET_DEFAULT_MEAN, std=IMAGENET_DEFAULT_STD):
156
+ return TF.normalize(
157
+ img.clone(),
158
+ mean= [-m/s for m, s in zip(mean, std)],
159
+ std= [1/s for s in std]
160
+ )
161
+
162
+ def plot_semseg_gt(input_dict, ax=None, image_size=224):
163
+ metadata = MetadataCatalog.get("coco_2017_val_panoptic")
164
+ instance_mode = ColorMode.IMAGE
165
+ img_viz = 255 * denormalize(input_dict['rgb'].detach().cpu())[0].permute(1,2,0)
166
+ semseg = F.interpolate(
167
+ input_dict['semseg'].unsqueeze(0).cpu().float(), size=image_size, mode='nearest'
168
+ ).long()[0,0]
169
+ visualizer = Visualizer(img_viz, metadata, instance_mode=instance_mode, scale=1)
170
+ visualizer.draw_sem_seg(semseg)
171
+ if ax is not None:
172
+ ax.imshow(visualizer.get_output().get_image())
173
+ else:
174
+ return visualizer.get_output().get_image()
175
+
176
+
177
+ def plot_semseg_gt_masked(input_dict, mask, ax=None, mask_value=1.0, image_size=224):
178
+ img = plot_semseg_gt(input_dict, image_size=image_size)
179
+ img = torch.LongTensor(img).permute(2,0,1).unsqueeze(0)
180
+ masked_img = get_masked_image(img.float()/255.0, mask, image_size=image_size, patch_size=16, mask_value=mask_value)
181
+ masked_img = masked_img[0].permute(1,2,0)
182
+
183
+ if ax is not None:
184
+ ax.imshow(masked_img)
185
+ else:
186
+ return masked_img
187
+
188
+
189
+ def get_pred_with_input(gt, pred, mask, image_size=224, patch_size=16):
190
+ gt_token = rearrange(
191
+ gt.detach().cpu(),
192
+ 'b c (nh ph) (nw pw) -> b (nh nw) (c ph pw)',
193
+ ph=patch_size, pw=patch_size, nh=image_size//patch_size, nw=image_size//patch_size
194
+ )
195
+ pred_token = rearrange(
196
+ pred.detach().cpu(),
197
+ 'b c (nh ph) (nw pw) -> b (nh nw) (c ph pw)',
198
+ ph=patch_size, pw=patch_size, nh=image_size//patch_size, nw=image_size//patch_size
199
+ )
200
+ pred_token[mask.detach().cpu()==0] = gt_token[mask.detach().cpu()==0]
201
+ img = rearrange(
202
+ pred_token,
203
+ 'b (nh nw) (c ph pw) -> b c (nh ph) (nw pw)',
204
+ ph=patch_size, pw=patch_size, nh=image_size//patch_size, nw=image_size//patch_size
205
+ )
206
+ return img
207
+
208
+
209
+ def plot_semseg_pred_masked(rgb, semseg_preds, semseg_gt, mask, ax=None, image_size=224):
210
+ metadata = MetadataCatalog.get("coco_2017_val_panoptic")
211
+ instance_mode = ColorMode.IMAGE
212
+ img_viz = 255 * denormalize(rgb.detach().cpu())[0].permute(1,2,0)
213
+
214
+ semseg = get_pred_with_input(
215
+ semseg_gt.unsqueeze(1),
216
+ semseg_preds.argmax(1).unsqueeze(1),
217
+ mask,
218
+ image_size=image_size//4,
219
+ patch_size=4
220
+ )
221
+
222
+ semseg = F.interpolate(semseg.float(), size=image_size, mode='nearest')[0,0].long()
223
+
224
+ visualizer = Visualizer(img_viz, metadata, instance_mode=instance_mode, scale=1)
225
+ visualizer.draw_sem_seg(semseg)
226
+ if ax is not None:
227
+ ax.imshow(visualizer.get_output().get_image())
228
+ else:
229
+ return visualizer.get_output().get_image()
230
+
231
+ def plot_predictions(input_dict, preds, masks, image_size=224):
232
+
233
+ masked_rgb = get_masked_image(
234
+ denormalize(input_dict['rgb']),
235
+ masks['rgb'],
236
+ image_size=image_size,
237
+ mask_value=1.0
238
+ )[0].permute(1,2,0).detach().cpu()
239
+ masked_depth = get_masked_image(
240
+ input_dict['depth'],
241
+ masks['depth'],
242
+ image_size=image_size,
243
+ mask_value=np.nan
244
+ )[0,0].detach().cpu()
245
+
246
+ pred_rgb = denormalize(preds['rgb'])[0].permute(1,2,0).clamp(0,1)
247
+ pred_depth = preds['depth'][0,0].detach().cpu()
248
+
249
+ pred_rgb2 = get_pred_with_input(
250
+ denormalize(input_dict['rgb']),
251
+ denormalize(preds['rgb']).clamp(0,1),
252
+ masks['rgb'],
253
+ image_size=image_size
254
+ )[0].permute(1,2,0).detach().cpu()
255
+ pred_depth2 = get_pred_with_input(
256
+ input_dict['depth'],
257
+ preds['depth'],
258
+ masks['depth'],
259
+ image_size=image_size
260
+ )[0,0].detach().cpu()
261
+
262
+ fig = plt.figure(figsize=(10, 10))
263
+ grid = ImageGrid(fig, 111, nrows_ncols=(3, 3), axes_pad=0)
264
+
265
+ grid[0].imshow(masked_rgb)
266
+ grid[1].imshow(pred_rgb2)
267
+ grid[2].imshow(denormalize(input_dict['rgb'])[0].permute(1,2,0).detach().cpu())
268
+
269
+ grid[3].imshow(masked_depth)
270
+ grid[4].imshow(pred_depth2)
271
+ grid[5].imshow(input_dict['depth'][0,0].detach().cpu())
272
+
273
+ plot_semseg_gt_masked(input_dict, masks['semseg'], grid[6], mask_value=1.0, image_size=image_size)
274
+ plot_semseg_pred_masked(input_dict['rgb'], preds['semseg'], input_dict['semseg'], masks['semseg'], grid[7], image_size=image_size)
275
+ plot_semseg_gt(input_dict, grid[8], image_size=image_size)
276
+
277
+ for ax in grid:
278
+ ax.set_xticks([])
279
+ ax.set_yticks([])
280
+
281
+ fontsize = 16
282
+ grid[0].set_title('Masked inputs', fontsize=fontsize)
283
+ grid[1].set_title('MultiMAE predictions', fontsize=fontsize)
284
+ grid[2].set_title('Original Reference', fontsize=fontsize)
285
+ grid[0].set_ylabel('RGB', fontsize=fontsize)
286
+ grid[3].set_ylabel('Depth', fontsize=fontsize)
287
+ grid[6].set_ylabel('Semantic', fontsize=fontsize)
288
+
289
+ plt.savefig('./output.png', dpi=300, bbox_inches='tight')
290
+ plt.close()
291
+
292
+
293
+ def inference(img, num_rgb, num_depth, num_semseg, seed, perform_sampling, alphas, num_tokens):
294
+ im = Image.open(img)
295
+
296
+ # Center crop and resize RGB
297
+ image_size = 224 # Train resolution
298
+ img = TF.center_crop(TF.to_tensor(im), min(im.size))
299
+ img = TF.resize(img, image_size)
300
+
301
+ # Predict depth and semseg
302
+ depth = predict_depth(img)
303
+ semseg = predict_semseg(img)
304
+
305
+
306
+ # Pre-process RGB, depth and semseg to the MultiMAE input format
307
+ input_dict = {}
308
+
309
+ # Normalize RGB
310
+ input_dict['rgb'] = TF.normalize(img, mean=IMAGENET_DEFAULT_MEAN, std=IMAGENET_DEFAULT_STD).unsqueeze(0)
311
+
312
+ # Normalize depth robustly
313
+ trunc_depth = torch.sort(depth.flatten())[0]
314
+ trunc_depth = trunc_depth[int(0.1 * trunc_depth.shape[0]): int(0.9 * trunc_depth.shape[0])]
315
+ depth = (depth - trunc_depth.mean()[None,None,None]) / torch.sqrt(trunc_depth.var()[None,None,None] + 1e-6)
316
+ input_dict['depth'] = depth.unsqueeze(0)
317
+
318
+ # Downsample semantic segmentation
319
+ stride = 4
320
+ semseg = TF.resize(semseg.unsqueeze(0), (semseg.shape[0] // stride, semseg.shape[1] // stride), interpolation=TF.InterpolationMode.NEAREST)
321
+ input_dict['semseg'] = semseg
322
+
323
+ # To GPU
324
+ input_dict = {k: v.to(device) for k,v in input_dict.items()}
325
+
326
+
327
+ torch.manual_seed(int(seed)) # change seed to resample new mask
328
+
329
+ if perform_sampling:
330
+ # Randomly sample masks
331
+
332
+ alphas = min(10000.0, max(0.00001, float(alphas))) # Clamp alphas to reasonable range
333
+
334
+ preds, masks = multimae.forward(
335
+ input_dict,
336
+ mask_inputs=True, # True if forward pass should sample random masks
337
+ num_encoded_tokens=num_tokens,
338
+ alphas=alphas
339
+ )
340
+ else:
341
+ # Randomly sample masks using the specified number of tokens per modality
342
+ task_masks = {domain: torch.ones(1,196).long().to(device) for domain in DOMAINS}
343
+ selected_rgb_idxs = torch.randperm(196)[:num_rgb]
344
+ selected_depth_idxs = torch.randperm(196)[:num_depth]
345
+ selected_semseg_idxs = torch.randperm(196)[:num_semseg]
346
+ task_masks['rgb'][:,selected_rgb_idxs] = 0
347
+ task_masks['depth'][:,selected_depth_idxs] = 0
348
+ task_masks['semseg'][:,selected_semseg_idxs] = 0
349
+
350
+ preds, masks = multimae.forward(
351
+ input_dict,
352
+ mask_inputs=True,
353
+ task_masks=task_masks
354
+ )
355
+
356
+ preds = {domain: pred.detach().cpu() for domain, pred in preds.items()}
357
+ masks = {domain: mask.detach().cpu() for domain, mask in masks.items()}
358
+
359
+ plot_predictions(input_dict, preds, masks)
360
+
361
+ return 'output.png'
362
+
363
+
364
+ title = "MultiMAE"
365
+ description = "Gradio demo for MultiMAE: Multi-modal Multi-task Masked Autoencoders. \
366
+ Upload your own images or try one of the examples below to explore the multi-modal masked reconstruction of a pre-trained MultiMAE model. \
367
+ Uploaded images are pseudo labeled using a DPT trained on Omnidata depth, and a Mask2Former trained on COCO. \
368
+ Choose the number of visible tokens using the sliders below (or sample them randomly) and see how MultiMAE reconstructs the modalities!"
369
+
370
+ article = "<p style='text-align: center'><a href='https://arxiv.org/abs/2204.01678' \
371
+ target='_blank'>MultiMAE: Multi-modal Multi-task Masked Autoencoders</a> | \
372
+ <a href='https://github.com/EPFL-VILAB/MultiMAE' target='_blank'>Github Repo</a></p>"
373
+
374
+ css = '.output-image{height: 713px !important}'
375
+
376
+ # Example images
377
+ os.system("wget https://i.imgur.com/c9ObJdK.jpg")
378
+ examples = [['c9ObJdK.jpg', 32, 32, 32, 0, True, 1.0, 98]]
379
+
380
+ gr.Interface(
381
+ fn=inference,
382
+ inputs=[
383
+ gr.inputs.Image(label='RGB input image', type='filepath'),
384
+ gr.inputs.Slider(label='Number of RGB input tokens', default=32, step=1, minimum=0, maximum=196),
385
+ gr.inputs.Slider(label='Number of depth input tokens', default=32, step=1, minimum=0, maximum=196),
386
+ gr.inputs.Slider(label='Number of semantic input tokens', default=32, step=1, minimum=0, maximum=196),
387
+ gr.inputs.Number(label='Random seed: Change this to sample different masks', default=0),
388
+ gr.inputs.Checkbox(label='Randomize the number of tokens: Check this to ignore the above sliders and randomly sample the number \
389
+ of tokens per modality using the parameters below', default=False),
390
+ gr.inputs.Slider(label='Symmetric Dirichlet concentration parameter (α > 0). Low values (α << 1.0) result in a sampling behavior, \
391
+ where most of the time, all visible tokens will be sampled from a single modality. High values \
392
+ (α >> 1.0) result in similar numbers of tokens being sampled for each modality. α = 1.0 is equivalent \
393
+ to uniform sampling over the simplex and contains both previous cases and everything in between.',
394
+ default=1.0, step=0.1, minimum=0.1, maximum=5.0),
395
+ gr.inputs.Slider(label='Number of input tokens', default=98, step=1, minimum=0, maximum=588),
396
+ ],
397
+ outputs=[
398
+ gr.outputs.Image(label='MultiMAE predictions', type='file')
399
+ ],
400
+ css=css,
401
+ title=title,
402
+ description=description,
403
+ article=article,
404
+ examples=examples
405
+ ).launch(enable_queue=True, cache_examples=True)
dpt/__init__.py ADDED
File without changes
dpt/base_model.py ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+
3
+
4
+ class BaseModel(torch.nn.Module):
5
+ def load(self, path):
6
+ """Load model from file.
7
+
8
+ Args:
9
+ path (str): file path
10
+ """
11
+ parameters = torch.load(path, map_location=torch.device("cpu"))
12
+
13
+ if "optimizer" in parameters:
14
+ parameters = parameters["model"]
15
+
16
+ self.load_state_dict(parameters)
dpt/blocks.py ADDED
@@ -0,0 +1,383 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ import torch.nn as nn
3
+
4
+ from .vit import (
5
+ _make_pretrained_vitb_rn50_384,
6
+ _make_pretrained_vitl16_384,
7
+ _make_pretrained_vitb16_384,
8
+ forward_vit,
9
+ )
10
+
11
+
12
+ def _make_encoder(
13
+ backbone,
14
+ features,
15
+ use_pretrained,
16
+ groups=1,
17
+ expand=False,
18
+ exportable=True,
19
+ hooks=None,
20
+ use_vit_only=False,
21
+ use_readout="ignore",
22
+ enable_attention_hooks=False,
23
+ ):
24
+ if backbone == "vitl16_384":
25
+ pretrained = _make_pretrained_vitl16_384(
26
+ use_pretrained,
27
+ hooks=hooks,
28
+ use_readout=use_readout,
29
+ enable_attention_hooks=enable_attention_hooks,
30
+ )
31
+ scratch = _make_scratch(
32
+ [256, 512, 1024, 1024], features, groups=groups, expand=expand
33
+ ) # ViT-L/16 - 85.0% Top1 (backbone)
34
+ elif backbone == "vitb_rn50_384":
35
+ pretrained = _make_pretrained_vitb_rn50_384(
36
+ use_pretrained,
37
+ hooks=hooks,
38
+ use_vit_only=use_vit_only,
39
+ use_readout=use_readout,
40
+ enable_attention_hooks=enable_attention_hooks,
41
+ )
42
+ scratch = _make_scratch(
43
+ [256, 512, 768, 768], features, groups=groups, expand=expand
44
+ ) # ViT-H/16 - 85.0% Top1 (backbone)
45
+ elif backbone == "vitb16_384":
46
+ pretrained = _make_pretrained_vitb16_384(
47
+ use_pretrained,
48
+ hooks=hooks,
49
+ use_readout=use_readout,
50
+ enable_attention_hooks=enable_attention_hooks,
51
+ )
52
+ scratch = _make_scratch(
53
+ [96, 192, 384, 768], features, groups=groups, expand=expand
54
+ ) # ViT-B/16 - 84.6% Top1 (backbone)
55
+ elif backbone == "resnext101_wsl":
56
+ pretrained = _make_pretrained_resnext101_wsl(use_pretrained)
57
+ scratch = _make_scratch(
58
+ [256, 512, 1024, 2048], features, groups=groups, expand=expand
59
+ ) # efficientnet_lite3
60
+ else:
61
+ print(f"Backbone '{backbone}' not implemented")
62
+ assert False
63
+
64
+ return pretrained, scratch
65
+
66
+
67
+ def _make_scratch(in_shape, out_shape, groups=1, expand=False):
68
+ scratch = nn.Module()
69
+
70
+ out_shape1 = out_shape
71
+ out_shape2 = out_shape
72
+ out_shape3 = out_shape
73
+ out_shape4 = out_shape
74
+ if expand == True:
75
+ out_shape1 = out_shape
76
+ out_shape2 = out_shape * 2
77
+ out_shape3 = out_shape * 4
78
+ out_shape4 = out_shape * 8
79
+
80
+ scratch.layer1_rn = nn.Conv2d(
81
+ in_shape[0],
82
+ out_shape1,
83
+ kernel_size=3,
84
+ stride=1,
85
+ padding=1,
86
+ bias=False,
87
+ groups=groups,
88
+ )
89
+ scratch.layer2_rn = nn.Conv2d(
90
+ in_shape[1],
91
+ out_shape2,
92
+ kernel_size=3,
93
+ stride=1,
94
+ padding=1,
95
+ bias=False,
96
+ groups=groups,
97
+ )
98
+ scratch.layer3_rn = nn.Conv2d(
99
+ in_shape[2],
100
+ out_shape3,
101
+ kernel_size=3,
102
+ stride=1,
103
+ padding=1,
104
+ bias=False,
105
+ groups=groups,
106
+ )
107
+ scratch.layer4_rn = nn.Conv2d(
108
+ in_shape[3],
109
+ out_shape4,
110
+ kernel_size=3,
111
+ stride=1,
112
+ padding=1,
113
+ bias=False,
114
+ groups=groups,
115
+ )
116
+
117
+ return scratch
118
+
119
+
120
+ def _make_resnet_backbone(resnet):
121
+ pretrained = nn.Module()
122
+ pretrained.layer1 = nn.Sequential(
123
+ resnet.conv1, resnet.bn1, resnet.relu, resnet.maxpool, resnet.layer1
124
+ )
125
+
126
+ pretrained.layer2 = resnet.layer2
127
+ pretrained.layer3 = resnet.layer3
128
+ pretrained.layer4 = resnet.layer4
129
+
130
+ return pretrained
131
+
132
+
133
+ def _make_pretrained_resnext101_wsl(use_pretrained):
134
+ resnet = torch.hub.load("facebookresearch/WSL-Images", "resnext101_32x8d_wsl")
135
+ return _make_resnet_backbone(resnet)
136
+
137
+
138
+ class Interpolate(nn.Module):
139
+ """Interpolation module."""
140
+
141
+ def __init__(self, scale_factor, mode, align_corners=False):
142
+ """Init.
143
+
144
+ Args:
145
+ scale_factor (float): scaling
146
+ mode (str): interpolation mode
147
+ """
148
+ super(Interpolate, self).__init__()
149
+
150
+ self.interp = nn.functional.interpolate
151
+ self.scale_factor = scale_factor
152
+ self.mode = mode
153
+ self.align_corners = align_corners
154
+
155
+ def forward(self, x):
156
+ """Forward pass.
157
+
158
+ Args:
159
+ x (tensor): input
160
+
161
+ Returns:
162
+ tensor: interpolated data
163
+ """
164
+
165
+ x = self.interp(
166
+ x,
167
+ scale_factor=self.scale_factor,
168
+ mode=self.mode,
169
+ align_corners=self.align_corners,
170
+ )
171
+
172
+ return x
173
+
174
+
175
+ class ResidualConvUnit(nn.Module):
176
+ """Residual convolution module."""
177
+
178
+ def __init__(self, features):
179
+ """Init.
180
+
181
+ Args:
182
+ features (int): number of features
183
+ """
184
+ super().__init__()
185
+
186
+ self.conv1 = nn.Conv2d(
187
+ features, features, kernel_size=3, stride=1, padding=1, bias=True
188
+ )
189
+
190
+ self.conv2 = nn.Conv2d(
191
+ features, features, kernel_size=3, stride=1, padding=1, bias=True
192
+ )
193
+
194
+ self.relu = nn.ReLU(inplace=True)
195
+
196
+ def forward(self, x):
197
+ """Forward pass.
198
+
199
+ Args:
200
+ x (tensor): input
201
+
202
+ Returns:
203
+ tensor: output
204
+ """
205
+ out = self.relu(x)
206
+ out = self.conv1(out)
207
+ out = self.relu(out)
208
+ out = self.conv2(out)
209
+
210
+ return out + x
211
+
212
+
213
+ class FeatureFusionBlock(nn.Module):
214
+ """Feature fusion block."""
215
+
216
+ def __init__(self, features):
217
+ """Init.
218
+
219
+ Args:
220
+ features (int): number of features
221
+ """
222
+ super(FeatureFusionBlock, self).__init__()
223
+
224
+ self.resConfUnit1 = ResidualConvUnit(features)
225
+ self.resConfUnit2 = ResidualConvUnit(features)
226
+
227
+ def forward(self, *xs):
228
+ """Forward pass.
229
+
230
+ Returns:
231
+ tensor: output
232
+ """
233
+ output = xs[0]
234
+
235
+ if len(xs) == 2:
236
+ output += self.resConfUnit1(xs[1])
237
+
238
+ output = self.resConfUnit2(output)
239
+
240
+ output = nn.functional.interpolate(
241
+ output, scale_factor=2, mode="bilinear", align_corners=True
242
+ )
243
+
244
+ return output
245
+
246
+
247
+ class ResidualConvUnit_custom(nn.Module):
248
+ """Residual convolution module."""
249
+
250
+ def __init__(self, features, activation, bn):
251
+ """Init.
252
+
253
+ Args:
254
+ features (int): number of features
255
+ """
256
+ super().__init__()
257
+
258
+ self.bn = bn
259
+
260
+ self.groups = 1
261
+
262
+ self.conv1 = nn.Conv2d(
263
+ features,
264
+ features,
265
+ kernel_size=3,
266
+ stride=1,
267
+ padding=1,
268
+ bias=not self.bn,
269
+ groups=self.groups,
270
+ )
271
+
272
+ self.conv2 = nn.Conv2d(
273
+ features,
274
+ features,
275
+ kernel_size=3,
276
+ stride=1,
277
+ padding=1,
278
+ bias=not self.bn,
279
+ groups=self.groups,
280
+ )
281
+
282
+ if self.bn == True:
283
+ self.bn1 = nn.BatchNorm2d(features)
284
+ self.bn2 = nn.BatchNorm2d(features)
285
+
286
+ self.activation = activation
287
+
288
+ self.skip_add = nn.quantized.FloatFunctional()
289
+
290
+ def forward(self, x):
291
+ """Forward pass.
292
+
293
+ Args:
294
+ x (tensor): input
295
+
296
+ Returns:
297
+ tensor: output
298
+ """
299
+
300
+ out = self.activation(x)
301
+ out = self.conv1(out)
302
+ if self.bn == True:
303
+ out = self.bn1(out)
304
+
305
+ out = self.activation(out)
306
+ out = self.conv2(out)
307
+ if self.bn == True:
308
+ out = self.bn2(out)
309
+
310
+ if self.groups > 1:
311
+ out = self.conv_merge(out)
312
+
313
+ return self.skip_add.add(out, x)
314
+
315
+ # return out + x
316
+
317
+
318
+ class FeatureFusionBlock_custom(nn.Module):
319
+ """Feature fusion block."""
320
+
321
+ def __init__(
322
+ self,
323
+ features,
324
+ activation,
325
+ deconv=False,
326
+ bn=False,
327
+ expand=False,
328
+ align_corners=True,
329
+ ):
330
+ """Init.
331
+
332
+ Args:
333
+ features (int): number of features
334
+ """
335
+ super(FeatureFusionBlock_custom, self).__init__()
336
+
337
+ self.deconv = deconv
338
+ self.align_corners = align_corners
339
+
340
+ self.groups = 1
341
+
342
+ self.expand = expand
343
+ out_features = features
344
+ if self.expand == True:
345
+ out_features = features // 2
346
+
347
+ self.out_conv = nn.Conv2d(
348
+ features,
349
+ out_features,
350
+ kernel_size=1,
351
+ stride=1,
352
+ padding=0,
353
+ bias=True,
354
+ groups=1,
355
+ )
356
+
357
+ self.resConfUnit1 = ResidualConvUnit_custom(features, activation, bn)
358
+ self.resConfUnit2 = ResidualConvUnit_custom(features, activation, bn)
359
+
360
+ self.skip_add = nn.quantized.FloatFunctional()
361
+
362
+ def forward(self, *xs):
363
+ """Forward pass.
364
+
365
+ Returns:
366
+ tensor: output
367
+ """
368
+ output = xs[0]
369
+
370
+ if len(xs) == 2:
371
+ res = self.resConfUnit1(xs[1])
372
+ output = self.skip_add.add(output, res)
373
+ # output += res
374
+
375
+ output = self.resConfUnit2(output)
376
+
377
+ output = nn.functional.interpolate(
378
+ output, scale_factor=2, mode="bilinear", align_corners=self.align_corners
379
+ )
380
+
381
+ output = self.out_conv(output)
382
+
383
+ return output
dpt/midas_net.py ADDED
@@ -0,0 +1,77 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """MidashNet: Network for monocular depth estimation trained by mixing several datasets.
2
+ This file contains code that is adapted from
3
+ https://github.com/thomasjpfan/pytorch_refinenet/blob/master/pytorch_refinenet/refinenet/refinenet_4cascade.py
4
+ """
5
+ import torch
6
+ import torch.nn as nn
7
+
8
+ from .base_model import BaseModel
9
+ from .blocks import FeatureFusionBlock, Interpolate, _make_encoder
10
+
11
+
12
+ class MidasNet_large(BaseModel):
13
+ """Network for monocular depth estimation."""
14
+
15
+ def __init__(self, path=None, features=256, non_negative=True):
16
+ """Init.
17
+
18
+ Args:
19
+ path (str, optional): Path to saved model. Defaults to None.
20
+ features (int, optional): Number of features. Defaults to 256.
21
+ backbone (str, optional): Backbone network for encoder. Defaults to resnet50
22
+ """
23
+ print("Loading weights: ", path)
24
+
25
+ super(MidasNet_large, self).__init__()
26
+
27
+ use_pretrained = False if path is None else True
28
+
29
+ self.pretrained, self.scratch = _make_encoder(
30
+ backbone="resnext101_wsl", features=features, use_pretrained=use_pretrained
31
+ )
32
+
33
+ self.scratch.refinenet4 = FeatureFusionBlock(features)
34
+ self.scratch.refinenet3 = FeatureFusionBlock(features)
35
+ self.scratch.refinenet2 = FeatureFusionBlock(features)
36
+ self.scratch.refinenet1 = FeatureFusionBlock(features)
37
+
38
+ self.scratch.output_conv = nn.Sequential(
39
+ nn.Conv2d(features, 128, kernel_size=3, stride=1, padding=1),
40
+ Interpolate(scale_factor=2, mode="bilinear"),
41
+ nn.Conv2d(128, 32, kernel_size=3, stride=1, padding=1),
42
+ nn.ReLU(True),
43
+ nn.Conv2d(32, 1, kernel_size=1, stride=1, padding=0),
44
+ nn.ReLU(True) if non_negative else nn.Identity(),
45
+ )
46
+
47
+ if path:
48
+ self.load(path)
49
+
50
+ def forward(self, x):
51
+ """Forward pass.
52
+
53
+ Args:
54
+ x (tensor): input data (image)
55
+
56
+ Returns:
57
+ tensor: depth
58
+ """
59
+
60
+ layer_1 = self.pretrained.layer1(x)
61
+ layer_2 = self.pretrained.layer2(layer_1)
62
+ layer_3 = self.pretrained.layer3(layer_2)
63
+ layer_4 = self.pretrained.layer4(layer_3)
64
+
65
+ layer_1_rn = self.scratch.layer1_rn(layer_1)
66
+ layer_2_rn = self.scratch.layer2_rn(layer_2)
67
+ layer_3_rn = self.scratch.layer3_rn(layer_3)
68
+ layer_4_rn = self.scratch.layer4_rn(layer_4)
69
+
70
+ path_4 = self.scratch.refinenet4(layer_4_rn)
71
+ path_3 = self.scratch.refinenet3(path_4, layer_3_rn)
72
+ path_2 = self.scratch.refinenet2(path_3, layer_2_rn)
73
+ path_1 = self.scratch.refinenet1(path_2, layer_1_rn)
74
+
75
+ out = self.scratch.output_conv(path_1)
76
+
77
+ return torch.squeeze(out, dim=1)
dpt/models.py ADDED
@@ -0,0 +1,153 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ import torch.nn as nn
3
+ import torch.nn.functional as F
4
+
5
+ from .base_model import BaseModel
6
+ from .blocks import (
7
+ FeatureFusionBlock,
8
+ FeatureFusionBlock_custom,
9
+ Interpolate,
10
+ _make_encoder,
11
+ forward_vit,
12
+ )
13
+
14
+
15
+ def _make_fusion_block(features, use_bn):
16
+ return FeatureFusionBlock_custom(
17
+ features,
18
+ nn.ReLU(False),
19
+ deconv=False,
20
+ bn=use_bn,
21
+ expand=False,
22
+ align_corners=True,
23
+ )
24
+
25
+
26
+ class DPT(BaseModel):
27
+ def __init__(
28
+ self,
29
+ head,
30
+ features=256,
31
+ backbone="vitb_rn50_384",
32
+ readout="project",
33
+ channels_last=False,
34
+ use_bn=False,
35
+ enable_attention_hooks=False,
36
+ ):
37
+
38
+ super(DPT, self).__init__()
39
+
40
+ self.channels_last = channels_last
41
+
42
+ hooks = {
43
+ "vitb_rn50_384": [0, 1, 8, 11],
44
+ "vitb16_384": [2, 5, 8, 11],
45
+ "vitl16_384": [5, 11, 17, 23],
46
+ }
47
+
48
+ # Instantiate backbone and reassemble blocks
49
+ self.pretrained, self.scratch = _make_encoder(
50
+ backbone,
51
+ features,
52
+ False, # Set to true of you want to train from scratch, uses ImageNet weights
53
+ groups=1,
54
+ expand=False,
55
+ exportable=False,
56
+ hooks=hooks[backbone],
57
+ use_readout=readout,
58
+ enable_attention_hooks=enable_attention_hooks,
59
+ )
60
+
61
+ self.scratch.refinenet1 = _make_fusion_block(features, use_bn)
62
+ self.scratch.refinenet2 = _make_fusion_block(features, use_bn)
63
+ self.scratch.refinenet3 = _make_fusion_block(features, use_bn)
64
+ self.scratch.refinenet4 = _make_fusion_block(features, use_bn)
65
+
66
+ self.scratch.output_conv = head
67
+
68
+ def forward(self, x):
69
+ if self.channels_last == True:
70
+ x.contiguous(memory_format=torch.channels_last)
71
+
72
+ layer_1, layer_2, layer_3, layer_4 = forward_vit(self.pretrained, x)
73
+
74
+ layer_1_rn = self.scratch.layer1_rn(layer_1)
75
+ layer_2_rn = self.scratch.layer2_rn(layer_2)
76
+ layer_3_rn = self.scratch.layer3_rn(layer_3)
77
+ layer_4_rn = self.scratch.layer4_rn(layer_4)
78
+
79
+ path_4 = self.scratch.refinenet4(layer_4_rn)
80
+ path_3 = self.scratch.refinenet3(path_4, layer_3_rn)
81
+ path_2 = self.scratch.refinenet2(path_3, layer_2_rn)
82
+ path_1 = self.scratch.refinenet1(path_2, layer_1_rn)
83
+
84
+ out = self.scratch.output_conv(path_1)
85
+
86
+ return out
87
+
88
+
89
+ class DPTDepthModel(DPT):
90
+ def __init__(
91
+ self, path=None, non_negative=True, scale=1.0, shift=0.0, invert=False, **kwargs
92
+ ):
93
+ features = kwargs["features"] if "features" in kwargs else 256
94
+
95
+ self.scale = scale
96
+ self.shift = shift
97
+ self.invert = invert
98
+
99
+ head = nn.Sequential(
100
+ nn.Conv2d(features, features // 2, kernel_size=3, stride=1, padding=1),
101
+ Interpolate(scale_factor=2, mode="bilinear", align_corners=True),
102
+ nn.Conv2d(features // 2, 32, kernel_size=3, stride=1, padding=1),
103
+ nn.ReLU(True),
104
+ nn.Conv2d(32, 1, kernel_size=1, stride=1, padding=0),
105
+ nn.ReLU(True) if non_negative else nn.Identity(),
106
+ nn.Identity(),
107
+ )
108
+
109
+ super().__init__(head, **kwargs)
110
+
111
+ if path is not None:
112
+ self.load(path)
113
+
114
+ def forward(self, x):
115
+ inv_depth = super().forward(x).squeeze(dim=1)
116
+
117
+ if self.invert:
118
+ depth = self.scale * inv_depth + self.shift
119
+ depth[depth < 1e-8] = 1e-8
120
+ depth = 1.0 / depth
121
+ return depth
122
+ else:
123
+ return inv_depth
124
+
125
+
126
+ class DPTSegmentationModel(DPT):
127
+ def __init__(self, num_classes, path=None, **kwargs):
128
+
129
+ features = kwargs["features"] if "features" in kwargs else 256
130
+
131
+ kwargs["use_bn"] = True
132
+
133
+ head = nn.Sequential(
134
+ nn.Conv2d(features, features, kernel_size=3, padding=1, bias=False),
135
+ nn.BatchNorm2d(features),
136
+ nn.ReLU(True),
137
+ nn.Dropout(0.1, False),
138
+ nn.Conv2d(features, num_classes, kernel_size=1),
139
+ Interpolate(scale_factor=2, mode="bilinear", align_corners=True),
140
+ )
141
+
142
+ super().__init__(head, **kwargs)
143
+
144
+ self.auxlayer = nn.Sequential(
145
+ nn.Conv2d(features, features, kernel_size=3, padding=1, bias=False),
146
+ nn.BatchNorm2d(features),
147
+ nn.ReLU(True),
148
+ nn.Dropout(0.1, False),
149
+ nn.Conv2d(features, num_classes, kernel_size=1),
150
+ )
151
+
152
+ if path is not None:
153
+ self.load(path)
dpt/transforms.py ADDED
@@ -0,0 +1,231 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import numpy as np
2
+ import cv2
3
+ import math
4
+
5
+
6
+ def apply_min_size(sample, size, image_interpolation_method=cv2.INTER_AREA):
7
+ """Rezise the sample to ensure the given size. Keeps aspect ratio.
8
+
9
+ Args:
10
+ sample (dict): sample
11
+ size (tuple): image size
12
+
13
+ Returns:
14
+ tuple: new size
15
+ """
16
+ shape = list(sample["disparity"].shape)
17
+
18
+ if shape[0] >= size[0] and shape[1] >= size[1]:
19
+ return sample
20
+
21
+ scale = [0, 0]
22
+ scale[0] = size[0] / shape[0]
23
+ scale[1] = size[1] / shape[1]
24
+
25
+ scale = max(scale)
26
+
27
+ shape[0] = math.ceil(scale * shape[0])
28
+ shape[1] = math.ceil(scale * shape[1])
29
+
30
+ # resize
31
+ sample["image"] = cv2.resize(
32
+ sample["image"], tuple(shape[::-1]), interpolation=image_interpolation_method
33
+ )
34
+
35
+ sample["disparity"] = cv2.resize(
36
+ sample["disparity"], tuple(shape[::-1]), interpolation=cv2.INTER_NEAREST
37
+ )
38
+ sample["mask"] = cv2.resize(
39
+ sample["mask"].astype(np.float32),
40
+ tuple(shape[::-1]),
41
+ interpolation=cv2.INTER_NEAREST,
42
+ )
43
+ sample["mask"] = sample["mask"].astype(bool)
44
+
45
+ return tuple(shape)
46
+
47
+
48
+ class Resize(object):
49
+ """Resize sample to given size (width, height)."""
50
+
51
+ def __init__(
52
+ self,
53
+ width,
54
+ height,
55
+ resize_target=True,
56
+ keep_aspect_ratio=False,
57
+ ensure_multiple_of=1,
58
+ resize_method="lower_bound",
59
+ image_interpolation_method=cv2.INTER_AREA,
60
+ ):
61
+ """Init.
62
+
63
+ Args:
64
+ width (int): desired output width
65
+ height (int): desired output height
66
+ resize_target (bool, optional):
67
+ True: Resize the full sample (image, mask, target).
68
+ False: Resize image only.
69
+ Defaults to True.
70
+ keep_aspect_ratio (bool, optional):
71
+ True: Keep the aspect ratio of the input sample.
72
+ Output sample might not have the given width and height, and
73
+ resize behaviour depends on the parameter 'resize_method'.
74
+ Defaults to False.
75
+ ensure_multiple_of (int, optional):
76
+ Output width and height is constrained to be multiple of this parameter.
77
+ Defaults to 1.
78
+ resize_method (str, optional):
79
+ "lower_bound": Output will be at least as large as the given size.
80
+ "upper_bound": Output will be at max as large as the given size. (Output size might be smaller than given size.)
81
+ "minimal": Scale as least as possible. (Output size might be smaller than given size.)
82
+ Defaults to "lower_bound".
83
+ """
84
+ self.__width = width
85
+ self.__height = height
86
+
87
+ self.__resize_target = resize_target
88
+ self.__keep_aspect_ratio = keep_aspect_ratio
89
+ self.__multiple_of = ensure_multiple_of
90
+ self.__resize_method = resize_method
91
+ self.__image_interpolation_method = image_interpolation_method
92
+
93
+ def constrain_to_multiple_of(self, x, min_val=0, max_val=None):
94
+ y = (np.round(x / self.__multiple_of) * self.__multiple_of).astype(int)
95
+
96
+ if max_val is not None and y > max_val:
97
+ y = (np.floor(x / self.__multiple_of) * self.__multiple_of).astype(int)
98
+
99
+ if y < min_val:
100
+ y = (np.ceil(x / self.__multiple_of) * self.__multiple_of).astype(int)
101
+
102
+ return y
103
+
104
+ def get_size(self, width, height):
105
+ # determine new height and width
106
+ scale_height = self.__height / height
107
+ scale_width = self.__width / width
108
+
109
+ if self.__keep_aspect_ratio:
110
+ if self.__resize_method == "lower_bound":
111
+ # scale such that output size is lower bound
112
+ if scale_width > scale_height:
113
+ # fit width
114
+ scale_height = scale_width
115
+ else:
116
+ # fit height
117
+ scale_width = scale_height
118
+ elif self.__resize_method == "upper_bound":
119
+ # scale such that output size is upper bound
120
+ if scale_width < scale_height:
121
+ # fit width
122
+ scale_height = scale_width
123
+ else:
124
+ # fit height
125
+ scale_width = scale_height
126
+ elif self.__resize_method == "minimal":
127
+ # scale as least as possbile
128
+ if abs(1 - scale_width) < abs(1 - scale_height):
129
+ # fit width
130
+ scale_height = scale_width
131
+ else:
132
+ # fit height
133
+ scale_width = scale_height
134
+ else:
135
+ raise ValueError(
136
+ f"resize_method {self.__resize_method} not implemented"
137
+ )
138
+
139
+ if self.__resize_method == "lower_bound":
140
+ new_height = self.constrain_to_multiple_of(
141
+ scale_height * height, min_val=self.__height
142
+ )
143
+ new_width = self.constrain_to_multiple_of(
144
+ scale_width * width, min_val=self.__width
145
+ )
146
+ elif self.__resize_method == "upper_bound":
147
+ new_height = self.constrain_to_multiple_of(
148
+ scale_height * height, max_val=self.__height
149
+ )
150
+ new_width = self.constrain_to_multiple_of(
151
+ scale_width * width, max_val=self.__width
152
+ )
153
+ elif self.__resize_method == "minimal":
154
+ new_height = self.constrain_to_multiple_of(scale_height * height)
155
+ new_width = self.constrain_to_multiple_of(scale_width * width)
156
+ else:
157
+ raise ValueError(f"resize_method {self.__resize_method} not implemented")
158
+
159
+ return (new_width, new_height)
160
+
161
+ def __call__(self, sample):
162
+ width, height = self.get_size(
163
+ sample["image"].shape[1], sample["image"].shape[0]
164
+ )
165
+
166
+ # resize sample
167
+ sample["image"] = cv2.resize(
168
+ sample["image"],
169
+ (width, height),
170
+ interpolation=self.__image_interpolation_method,
171
+ )
172
+
173
+ if self.__resize_target:
174
+ if "disparity" in sample:
175
+ sample["disparity"] = cv2.resize(
176
+ sample["disparity"],
177
+ (width, height),
178
+ interpolation=cv2.INTER_NEAREST,
179
+ )
180
+
181
+ if "depth" in sample:
182
+ sample["depth"] = cv2.resize(
183
+ sample["depth"], (width, height), interpolation=cv2.INTER_NEAREST
184
+ )
185
+
186
+ sample["mask"] = cv2.resize(
187
+ sample["mask"].astype(np.float32),
188
+ (width, height),
189
+ interpolation=cv2.INTER_NEAREST,
190
+ )
191
+ sample["mask"] = sample["mask"].astype(bool)
192
+
193
+ return sample
194
+
195
+
196
+ class NormalizeImage(object):
197
+ """Normlize image by given mean and std."""
198
+
199
+ def __init__(self, mean, std):
200
+ self.__mean = mean
201
+ self.__std = std
202
+
203
+ def __call__(self, sample):
204
+ sample["image"] = (sample["image"] - self.__mean) / self.__std
205
+
206
+ return sample
207
+
208
+
209
+ class PrepareForNet(object):
210
+ """Prepare sample for usage as network input."""
211
+
212
+ def __init__(self):
213
+ pass
214
+
215
+ def __call__(self, sample):
216
+ image = np.transpose(sample["image"], (2, 0, 1))
217
+ sample["image"] = np.ascontiguousarray(image).astype(np.float32)
218
+
219
+ if "mask" in sample:
220
+ sample["mask"] = sample["mask"].astype(np.float32)
221
+ sample["mask"] = np.ascontiguousarray(sample["mask"])
222
+
223
+ if "disparity" in sample:
224
+ disparity = sample["disparity"].astype(np.float32)
225
+ sample["disparity"] = np.ascontiguousarray(disparity)
226
+
227
+ if "depth" in sample:
228
+ depth = sample["depth"].astype(np.float32)
229
+ sample["depth"] = np.ascontiguousarray(depth)
230
+
231
+ return sample
dpt/vit.py ADDED
@@ -0,0 +1,576 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ import torch.nn as nn
3
+ import timm
4
+ import types
5
+ import math
6
+ import torch.nn.functional as F
7
+
8
+
9
+ activations = {}
10
+
11
+
12
+ def get_activation(name):
13
+ def hook(model, input, output):
14
+ activations[name] = output
15
+
16
+ return hook
17
+
18
+
19
+ attention = {}
20
+
21
+
22
+ def get_attention(name):
23
+ def hook(module, input, output):
24
+ x = input[0]
25
+ B, N, C = x.shape
26
+ qkv = (
27
+ module.qkv(x)
28
+ .reshape(B, N, 3, module.num_heads, C // module.num_heads)
29
+ .permute(2, 0, 3, 1, 4)
30
+ )
31
+ q, k, v = (
32
+ qkv[0],
33
+ qkv[1],
34
+ qkv[2],
35
+ ) # make torchscript happy (cannot use tensor as tuple)
36
+
37
+ attn = (q @ k.transpose(-2, -1)) * module.scale
38
+
39
+ attn = attn.softmax(dim=-1) # [:,:,1,1:]
40
+ attention[name] = attn
41
+
42
+ return hook
43
+
44
+
45
+ def get_mean_attention_map(attn, token, shape):
46
+ attn = attn[:, :, token, 1:]
47
+ attn = attn.unflatten(2, torch.Size([shape[2] // 16, shape[3] // 16])).float()
48
+ attn = torch.nn.functional.interpolate(
49
+ attn, size=shape[2:], mode="bicubic", align_corners=False
50
+ ).squeeze(0)
51
+
52
+ all_attn = torch.mean(attn, 0)
53
+
54
+ return all_attn
55
+
56
+
57
+ class Slice(nn.Module):
58
+ def __init__(self, start_index=1):
59
+ super(Slice, self).__init__()
60
+ self.start_index = start_index
61
+
62
+ def forward(self, x):
63
+ return x[:, self.start_index :]
64
+
65
+
66
+ class AddReadout(nn.Module):
67
+ def __init__(self, start_index=1):
68
+ super(AddReadout, self).__init__()
69
+ self.start_index = start_index
70
+
71
+ def forward(self, x):
72
+ if self.start_index == 2:
73
+ readout = (x[:, 0] + x[:, 1]) / 2
74
+ else:
75
+ readout = x[:, 0]
76
+ return x[:, self.start_index :] + readout.unsqueeze(1)
77
+
78
+
79
+ class ProjectReadout(nn.Module):
80
+ def __init__(self, in_features, start_index=1):
81
+ super(ProjectReadout, self).__init__()
82
+ self.start_index = start_index
83
+
84
+ self.project = nn.Sequential(nn.Linear(2 * in_features, in_features), nn.GELU())
85
+
86
+ def forward(self, x):
87
+ readout = x[:, 0].unsqueeze(1).expand_as(x[:, self.start_index :])
88
+ features = torch.cat((x[:, self.start_index :], readout), -1)
89
+
90
+ return self.project(features)
91
+
92
+
93
+ class Transpose(nn.Module):
94
+ def __init__(self, dim0, dim1):
95
+ super(Transpose, self).__init__()
96
+ self.dim0 = dim0
97
+ self.dim1 = dim1
98
+
99
+ def forward(self, x):
100
+ x = x.transpose(self.dim0, self.dim1)
101
+ return x
102
+
103
+
104
+ def forward_vit(pretrained, x):
105
+ b, c, h, w = x.shape
106
+
107
+ glob = pretrained.model.forward_flex(x)
108
+
109
+ layer_1 = pretrained.activations["1"]
110
+ layer_2 = pretrained.activations["2"]
111
+ layer_3 = pretrained.activations["3"]
112
+ layer_4 = pretrained.activations["4"]
113
+
114
+ layer_1 = pretrained.act_postprocess1[0:2](layer_1)
115
+ layer_2 = pretrained.act_postprocess2[0:2](layer_2)
116
+ layer_3 = pretrained.act_postprocess3[0:2](layer_3)
117
+ layer_4 = pretrained.act_postprocess4[0:2](layer_4)
118
+
119
+ unflatten = nn.Sequential(
120
+ nn.Unflatten(
121
+ 2,
122
+ torch.Size(
123
+ [
124
+ h // pretrained.model.patch_size[1],
125
+ w // pretrained.model.patch_size[0],
126
+ ]
127
+ ),
128
+ )
129
+ )
130
+
131
+ if layer_1.ndim == 3:
132
+ layer_1 = unflatten(layer_1)
133
+ if layer_2.ndim == 3:
134
+ layer_2 = unflatten(layer_2)
135
+ if layer_3.ndim == 3:
136
+ layer_3 = unflatten(layer_3)
137
+ if layer_4.ndim == 3:
138
+ layer_4 = unflatten(layer_4)
139
+
140
+ layer_1 = pretrained.act_postprocess1[3 : len(pretrained.act_postprocess1)](layer_1)
141
+ layer_2 = pretrained.act_postprocess2[3 : len(pretrained.act_postprocess2)](layer_2)
142
+ layer_3 = pretrained.act_postprocess3[3 : len(pretrained.act_postprocess3)](layer_3)
143
+ layer_4 = pretrained.act_postprocess4[3 : len(pretrained.act_postprocess4)](layer_4)
144
+
145
+ return layer_1, layer_2, layer_3, layer_4
146
+
147
+
148
+ def _resize_pos_embed(self, posemb, gs_h, gs_w):
149
+ posemb_tok, posemb_grid = (
150
+ posemb[:, : self.start_index],
151
+ posemb[0, self.start_index :],
152
+ )
153
+
154
+ gs_old = int(math.sqrt(len(posemb_grid)))
155
+
156
+ posemb_grid = posemb_grid.reshape(1, gs_old, gs_old, -1).permute(0, 3, 1, 2)
157
+ posemb_grid = F.interpolate(posemb_grid, size=(gs_h, gs_w), mode="bilinear")
158
+ posemb_grid = posemb_grid.permute(0, 2, 3, 1).reshape(1, gs_h * gs_w, -1)
159
+
160
+ posemb = torch.cat([posemb_tok, posemb_grid], dim=1)
161
+
162
+ return posemb
163
+
164
+
165
+ def forward_flex(self, x):
166
+ b, c, h, w = x.shape
167
+
168
+ pos_embed = self._resize_pos_embed(
169
+ self.pos_embed, h // self.patch_size[1], w // self.patch_size[0]
170
+ )
171
+
172
+ B = x.shape[0]
173
+
174
+ if hasattr(self.patch_embed, "backbone"):
175
+ x = self.patch_embed.backbone(x)
176
+ if isinstance(x, (list, tuple)):
177
+ x = x[-1] # last feature if backbone outputs list/tuple of features
178
+
179
+ x = self.patch_embed.proj(x).flatten(2).transpose(1, 2)
180
+
181
+ if getattr(self, "dist_token", None) is not None:
182
+ cls_tokens = self.cls_token.expand(
183
+ B, -1, -1
184
+ ) # stole cls_tokens impl from Phil Wang, thanks
185
+ dist_token = self.dist_token.expand(B, -1, -1)
186
+ x = torch.cat((cls_tokens, dist_token, x), dim=1)
187
+ else:
188
+ cls_tokens = self.cls_token.expand(
189
+ B, -1, -1
190
+ ) # stole cls_tokens impl from Phil Wang, thanks
191
+ x = torch.cat((cls_tokens, x), dim=1)
192
+
193
+ x = x + pos_embed
194
+ x = self.pos_drop(x)
195
+
196
+ for blk in self.blocks:
197
+ x = blk(x)
198
+
199
+ x = self.norm(x)
200
+
201
+ return x
202
+
203
+
204
+ def get_readout_oper(vit_features, features, use_readout, start_index=1):
205
+ if use_readout == "ignore":
206
+ readout_oper = [Slice(start_index)] * len(features)
207
+ elif use_readout == "add":
208
+ readout_oper = [AddReadout(start_index)] * len(features)
209
+ elif use_readout == "project":
210
+ readout_oper = [
211
+ ProjectReadout(vit_features, start_index) for out_feat in features
212
+ ]
213
+ else:
214
+ assert (
215
+ False
216
+ ), "wrong operation for readout token, use_readout can be 'ignore', 'add', or 'project'"
217
+
218
+ return readout_oper
219
+
220
+
221
+ def _make_vit_b16_backbone(
222
+ model,
223
+ features=[96, 192, 384, 768],
224
+ size=[384, 384],
225
+ hooks=[2, 5, 8, 11],
226
+ vit_features=768,
227
+ use_readout="ignore",
228
+ start_index=1,
229
+ enable_attention_hooks=False,
230
+ ):
231
+ pretrained = nn.Module()
232
+
233
+ pretrained.model = model
234
+ pretrained.model.blocks[hooks[0]].register_forward_hook(get_activation("1"))
235
+ pretrained.model.blocks[hooks[1]].register_forward_hook(get_activation("2"))
236
+ pretrained.model.blocks[hooks[2]].register_forward_hook(get_activation("3"))
237
+ pretrained.model.blocks[hooks[3]].register_forward_hook(get_activation("4"))
238
+
239
+ pretrained.activations = activations
240
+
241
+ if enable_attention_hooks:
242
+ pretrained.model.blocks[hooks[0]].attn.register_forward_hook(
243
+ get_attention("attn_1")
244
+ )
245
+ pretrained.model.blocks[hooks[1]].attn.register_forward_hook(
246
+ get_attention("attn_2")
247
+ )
248
+ pretrained.model.blocks[hooks[2]].attn.register_forward_hook(
249
+ get_attention("attn_3")
250
+ )
251
+ pretrained.model.blocks[hooks[3]].attn.register_forward_hook(
252
+ get_attention("attn_4")
253
+ )
254
+ pretrained.attention = attention
255
+
256
+ readout_oper = get_readout_oper(vit_features, features, use_readout, start_index)
257
+
258
+ # 32, 48, 136, 384
259
+ pretrained.act_postprocess1 = nn.Sequential(
260
+ readout_oper[0],
261
+ Transpose(1, 2),
262
+ nn.Unflatten(2, torch.Size([size[0] // 16, size[1] // 16])),
263
+ nn.Conv2d(
264
+ in_channels=vit_features,
265
+ out_channels=features[0],
266
+ kernel_size=1,
267
+ stride=1,
268
+ padding=0,
269
+ ),
270
+ nn.ConvTranspose2d(
271
+ in_channels=features[0],
272
+ out_channels=features[0],
273
+ kernel_size=4,
274
+ stride=4,
275
+ padding=0,
276
+ bias=True,
277
+ dilation=1,
278
+ groups=1,
279
+ ),
280
+ )
281
+
282
+ pretrained.act_postprocess2 = nn.Sequential(
283
+ readout_oper[1],
284
+ Transpose(1, 2),
285
+ nn.Unflatten(2, torch.Size([size[0] // 16, size[1] // 16])),
286
+ nn.Conv2d(
287
+ in_channels=vit_features,
288
+ out_channels=features[1],
289
+ kernel_size=1,
290
+ stride=1,
291
+ padding=0,
292
+ ),
293
+ nn.ConvTranspose2d(
294
+ in_channels=features[1],
295
+ out_channels=features[1],
296
+ kernel_size=2,
297
+ stride=2,
298
+ padding=0,
299
+ bias=True,
300
+ dilation=1,
301
+ groups=1,
302
+ ),
303
+ )
304
+
305
+ pretrained.act_postprocess3 = nn.Sequential(
306
+ readout_oper[2],
307
+ Transpose(1, 2),
308
+ nn.Unflatten(2, torch.Size([size[0] // 16, size[1] // 16])),
309
+ nn.Conv2d(
310
+ in_channels=vit_features,
311
+ out_channels=features[2],
312
+ kernel_size=1,
313
+ stride=1,
314
+ padding=0,
315
+ ),
316
+ )
317
+
318
+ pretrained.act_postprocess4 = nn.Sequential(
319
+ readout_oper[3],
320
+ Transpose(1, 2),
321
+ nn.Unflatten(2, torch.Size([size[0] // 16, size[1] // 16])),
322
+ nn.Conv2d(
323
+ in_channels=vit_features,
324
+ out_channels=features[3],
325
+ kernel_size=1,
326
+ stride=1,
327
+ padding=0,
328
+ ),
329
+ nn.Conv2d(
330
+ in_channels=features[3],
331
+ out_channels=features[3],
332
+ kernel_size=3,
333
+ stride=2,
334
+ padding=1,
335
+ ),
336
+ )
337
+
338
+ pretrained.model.start_index = start_index
339
+ pretrained.model.patch_size = [16, 16]
340
+
341
+ # We inject this function into the VisionTransformer instances so that
342
+ # we can use it with interpolated position embeddings without modifying the library source.
343
+ pretrained.model.forward_flex = types.MethodType(forward_flex, pretrained.model)
344
+ pretrained.model._resize_pos_embed = types.MethodType(
345
+ _resize_pos_embed, pretrained.model
346
+ )
347
+
348
+ return pretrained
349
+
350
+
351
+ def _make_vit_b_rn50_backbone(
352
+ model,
353
+ features=[256, 512, 768, 768],
354
+ size=[384, 384],
355
+ hooks=[0, 1, 8, 11],
356
+ vit_features=768,
357
+ use_vit_only=False,
358
+ use_readout="ignore",
359
+ start_index=1,
360
+ enable_attention_hooks=False,
361
+ ):
362
+ pretrained = nn.Module()
363
+
364
+ pretrained.model = model
365
+
366
+ if use_vit_only == True:
367
+ pretrained.model.blocks[hooks[0]].register_forward_hook(get_activation("1"))
368
+ pretrained.model.blocks[hooks[1]].register_forward_hook(get_activation("2"))
369
+ else:
370
+ pretrained.model.patch_embed.backbone.stages[0].register_forward_hook(
371
+ get_activation("1")
372
+ )
373
+ pretrained.model.patch_embed.backbone.stages[1].register_forward_hook(
374
+ get_activation("2")
375
+ )
376
+
377
+ pretrained.model.blocks[hooks[2]].register_forward_hook(get_activation("3"))
378
+ pretrained.model.blocks[hooks[3]].register_forward_hook(get_activation("4"))
379
+
380
+ if enable_attention_hooks:
381
+ pretrained.model.blocks[2].attn.register_forward_hook(get_attention("attn_1"))
382
+ pretrained.model.blocks[5].attn.register_forward_hook(get_attention("attn_2"))
383
+ pretrained.model.blocks[8].attn.register_forward_hook(get_attention("attn_3"))
384
+ pretrained.model.blocks[11].attn.register_forward_hook(get_attention("attn_4"))
385
+ pretrained.attention = attention
386
+
387
+ pretrained.activations = activations
388
+
389
+ readout_oper = get_readout_oper(vit_features, features, use_readout, start_index)
390
+
391
+ if use_vit_only == True:
392
+ pretrained.act_postprocess1 = nn.Sequential(
393
+ readout_oper[0],
394
+ Transpose(1, 2),
395
+ nn.Unflatten(2, torch.Size([size[0] // 16, size[1] // 16])),
396
+ nn.Conv2d(
397
+ in_channels=vit_features,
398
+ out_channels=features[0],
399
+ kernel_size=1,
400
+ stride=1,
401
+ padding=0,
402
+ ),
403
+ nn.ConvTranspose2d(
404
+ in_channels=features[0],
405
+ out_channels=features[0],
406
+ kernel_size=4,
407
+ stride=4,
408
+ padding=0,
409
+ bias=True,
410
+ dilation=1,
411
+ groups=1,
412
+ ),
413
+ )
414
+
415
+ pretrained.act_postprocess2 = nn.Sequential(
416
+ readout_oper[1],
417
+ Transpose(1, 2),
418
+ nn.Unflatten(2, torch.Size([size[0] // 16, size[1] // 16])),
419
+ nn.Conv2d(
420
+ in_channels=vit_features,
421
+ out_channels=features[1],
422
+ kernel_size=1,
423
+ stride=1,
424
+ padding=0,
425
+ ),
426
+ nn.ConvTranspose2d(
427
+ in_channels=features[1],
428
+ out_channels=features[1],
429
+ kernel_size=2,
430
+ stride=2,
431
+ padding=0,
432
+ bias=True,
433
+ dilation=1,
434
+ groups=1,
435
+ ),
436
+ )
437
+ else:
438
+ pretrained.act_postprocess1 = nn.Sequential(
439
+ nn.Identity(), nn.Identity(), nn.Identity()
440
+ )
441
+ pretrained.act_postprocess2 = nn.Sequential(
442
+ nn.Identity(), nn.Identity(), nn.Identity()
443
+ )
444
+
445
+ pretrained.act_postprocess3 = nn.Sequential(
446
+ readout_oper[2],
447
+ Transpose(1, 2),
448
+ nn.Unflatten(2, torch.Size([size[0] // 16, size[1] // 16])),
449
+ nn.Conv2d(
450
+ in_channels=vit_features,
451
+ out_channels=features[2],
452
+ kernel_size=1,
453
+ stride=1,
454
+ padding=0,
455
+ ),
456
+ )
457
+
458
+ pretrained.act_postprocess4 = nn.Sequential(
459
+ readout_oper[3],
460
+ Transpose(1, 2),
461
+ nn.Unflatten(2, torch.Size([size[0] // 16, size[1] // 16])),
462
+ nn.Conv2d(
463
+ in_channels=vit_features,
464
+ out_channels=features[3],
465
+ kernel_size=1,
466
+ stride=1,
467
+ padding=0,
468
+ ),
469
+ nn.Conv2d(
470
+ in_channels=features[3],
471
+ out_channels=features[3],
472
+ kernel_size=3,
473
+ stride=2,
474
+ padding=1,
475
+ ),
476
+ )
477
+
478
+ pretrained.model.start_index = start_index
479
+ pretrained.model.patch_size = [16, 16]
480
+
481
+ # We inject this function into the VisionTransformer instances so that
482
+ # we can use it with interpolated position embeddings without modifying the library source.
483
+ pretrained.model.forward_flex = types.MethodType(forward_flex, pretrained.model)
484
+
485
+ # We inject this function into the VisionTransformer instances so that
486
+ # we can use it with interpolated position embeddings without modifying the library source.
487
+ pretrained.model._resize_pos_embed = types.MethodType(
488
+ _resize_pos_embed, pretrained.model
489
+ )
490
+
491
+ return pretrained
492
+
493
+
494
+ def _make_pretrained_vitb_rn50_384(
495
+ pretrained,
496
+ use_readout="ignore",
497
+ hooks=None,
498
+ use_vit_only=False,
499
+ enable_attention_hooks=False,
500
+ ):
501
+ model = timm.create_model("vit_base_resnet50_384", pretrained=pretrained)
502
+
503
+ hooks = [0, 1, 8, 11] if hooks == None else hooks
504
+ return _make_vit_b_rn50_backbone(
505
+ model,
506
+ features=[256, 512, 768, 768],
507
+ size=[384, 384],
508
+ hooks=hooks,
509
+ use_vit_only=use_vit_only,
510
+ use_readout=use_readout,
511
+ enable_attention_hooks=enable_attention_hooks,
512
+ )
513
+
514
+
515
+ def _make_pretrained_vitl16_384(
516
+ pretrained, use_readout="ignore", hooks=None, enable_attention_hooks=False
517
+ ):
518
+ model = timm.create_model("vit_large_patch16_384", pretrained=pretrained)
519
+
520
+ hooks = [5, 11, 17, 23] if hooks == None else hooks
521
+ return _make_vit_b16_backbone(
522
+ model,
523
+ features=[256, 512, 1024, 1024],
524
+ hooks=hooks,
525
+ vit_features=1024,
526
+ use_readout=use_readout,
527
+ enable_attention_hooks=enable_attention_hooks,
528
+ )
529
+
530
+
531
+ def _make_pretrained_vitb16_384(
532
+ pretrained, use_readout="ignore", hooks=None, enable_attention_hooks=False
533
+ ):
534
+ model = timm.create_model("vit_base_patch16_384", pretrained=pretrained)
535
+
536
+ hooks = [2, 5, 8, 11] if hooks == None else hooks
537
+ return _make_vit_b16_backbone(
538
+ model,
539
+ features=[96, 192, 384, 768],
540
+ hooks=hooks,
541
+ use_readout=use_readout,
542
+ enable_attention_hooks=enable_attention_hooks,
543
+ )
544
+
545
+
546
+ def _make_pretrained_deitb16_384(
547
+ pretrained, use_readout="ignore", hooks=None, enable_attention_hooks=False
548
+ ):
549
+ model = timm.create_model("vit_deit_base_patch16_384", pretrained=pretrained)
550
+
551
+ hooks = [2, 5, 8, 11] if hooks == None else hooks
552
+ return _make_vit_b16_backbone(
553
+ model,
554
+ features=[96, 192, 384, 768],
555
+ hooks=hooks,
556
+ use_readout=use_readout,
557
+ enable_attention_hooks=enable_attention_hooks,
558
+ )
559
+
560
+
561
+ def _make_pretrained_deitb16_distil_384(
562
+ pretrained, use_readout="ignore", hooks=None, enable_attention_hooks=False
563
+ ):
564
+ model = timm.create_model(
565
+ "vit_deit_base_distilled_patch16_384", pretrained=pretrained
566
+ )
567
+
568
+ hooks = [2, 5, 8, 11] if hooks == None else hooks
569
+ return _make_vit_b16_backbone(
570
+ model,
571
+ features=[96, 192, 384, 768],
572
+ hooks=hooks,
573
+ use_readout=use_readout,
574
+ start_index=2,
575
+ enable_attention_hooks=enable_attention_hooks,
576
+ )
mask2former/__init__.py ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (c) Facebook, Inc. and its affiliates.
2
+ from . import data # register all new datasets
3
+ from . import modeling
4
+
5
+ # config
6
+ from .config import add_maskformer2_config
7
+
8
+ # dataset loading
9
+ from .data.dataset_mappers.coco_instance_new_baseline_dataset_mapper import COCOInstanceNewBaselineDatasetMapper
10
+ from .data.dataset_mappers.coco_panoptic_new_baseline_dataset_mapper import COCOPanopticNewBaselineDatasetMapper
11
+ from .data.dataset_mappers.mask_former_instance_dataset_mapper import (
12
+ MaskFormerInstanceDatasetMapper,
13
+ )
14
+ from .data.dataset_mappers.mask_former_panoptic_dataset_mapper import (
15
+ MaskFormerPanopticDatasetMapper,
16
+ )
17
+ from .data.dataset_mappers.mask_former_semantic_dataset_mapper import (
18
+ MaskFormerSemanticDatasetMapper,
19
+ )
20
+
21
+ # models
22
+ from .maskformer_model import MaskFormer
23
+ from .test_time_augmentation import SemanticSegmentorWithTTA
24
+
25
+ # evaluation
26
+ from .evaluation.instance_evaluation import InstanceSegEvaluator
mask2former/config.py ADDED
@@ -0,0 +1,114 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # -*- coding: utf-8 -*-
2
+ # Copyright (c) Facebook, Inc. and its affiliates.
3
+ from detectron2.config import CfgNode as CN
4
+
5
+
6
+ def add_maskformer2_config(cfg):
7
+ """
8
+ Add config for MASK_FORMER.
9
+ """
10
+ # NOTE: configs from original maskformer
11
+ # data config
12
+ # select the dataset mapper
13
+ cfg.INPUT.DATASET_MAPPER_NAME = "mask_former_semantic"
14
+ # Color augmentation
15
+ cfg.INPUT.COLOR_AUG_SSD = False
16
+ # We retry random cropping until no single category in semantic segmentation GT occupies more
17
+ # than `SINGLE_CATEGORY_MAX_AREA` part of the crop.
18
+ cfg.INPUT.CROP.SINGLE_CATEGORY_MAX_AREA = 1.0
19
+ # Pad image and segmentation GT in dataset mapper.
20
+ cfg.INPUT.SIZE_DIVISIBILITY = -1
21
+
22
+ # solver config
23
+ # weight decay on embedding
24
+ cfg.SOLVER.WEIGHT_DECAY_EMBED = 0.0
25
+ # optimizer
26
+ cfg.SOLVER.OPTIMIZER = "ADAMW"
27
+ cfg.SOLVER.BACKBONE_MULTIPLIER = 0.1
28
+
29
+ # mask_former model config
30
+ cfg.MODEL.MASK_FORMER = CN()
31
+
32
+ # loss
33
+ cfg.MODEL.MASK_FORMER.DEEP_SUPERVISION = True
34
+ cfg.MODEL.MASK_FORMER.NO_OBJECT_WEIGHT = 0.1
35
+ cfg.MODEL.MASK_FORMER.CLASS_WEIGHT = 1.0
36
+ cfg.MODEL.MASK_FORMER.DICE_WEIGHT = 1.0
37
+ cfg.MODEL.MASK_FORMER.MASK_WEIGHT = 20.0
38
+
39
+ # transformer config
40
+ cfg.MODEL.MASK_FORMER.NHEADS = 8
41
+ cfg.MODEL.MASK_FORMER.DROPOUT = 0.1
42
+ cfg.MODEL.MASK_FORMER.DIM_FEEDFORWARD = 2048
43
+ cfg.MODEL.MASK_FORMER.ENC_LAYERS = 0
44
+ cfg.MODEL.MASK_FORMER.DEC_LAYERS = 6
45
+ cfg.MODEL.MASK_FORMER.PRE_NORM = False
46
+
47
+ cfg.MODEL.MASK_FORMER.HIDDEN_DIM = 256
48
+ cfg.MODEL.MASK_FORMER.NUM_OBJECT_QUERIES = 100
49
+
50
+ cfg.MODEL.MASK_FORMER.TRANSFORMER_IN_FEATURE = "res5"
51
+ cfg.MODEL.MASK_FORMER.ENFORCE_INPUT_PROJ = False
52
+
53
+ # mask_former inference config
54
+ cfg.MODEL.MASK_FORMER.TEST = CN()
55
+ cfg.MODEL.MASK_FORMER.TEST.SEMANTIC_ON = True
56
+ cfg.MODEL.MASK_FORMER.TEST.INSTANCE_ON = False
57
+ cfg.MODEL.MASK_FORMER.TEST.PANOPTIC_ON = False
58
+ cfg.MODEL.MASK_FORMER.TEST.OBJECT_MASK_THRESHOLD = 0.0
59
+ cfg.MODEL.MASK_FORMER.TEST.OVERLAP_THRESHOLD = 0.0
60
+ cfg.MODEL.MASK_FORMER.TEST.SEM_SEG_POSTPROCESSING_BEFORE_INFERENCE = False
61
+
62
+ # Sometimes `backbone.size_divisibility` is set to 0 for some backbone (e.g. ResNet)
63
+ # you can use this config to override
64
+ cfg.MODEL.MASK_FORMER.SIZE_DIVISIBILITY = 32
65
+
66
+ # pixel decoder config
67
+ cfg.MODEL.SEM_SEG_HEAD.MASK_DIM = 256
68
+ # adding transformer in pixel decoder
69
+ cfg.MODEL.SEM_SEG_HEAD.TRANSFORMER_ENC_LAYERS = 0
70
+ # pixel decoder
71
+ cfg.MODEL.SEM_SEG_HEAD.PIXEL_DECODER_NAME = "BasePixelDecoder"
72
+
73
+ # swin transformer backbone
74
+ cfg.MODEL.SWIN = CN()
75
+ cfg.MODEL.SWIN.PRETRAIN_IMG_SIZE = 224
76
+ cfg.MODEL.SWIN.PATCH_SIZE = 4
77
+ cfg.MODEL.SWIN.EMBED_DIM = 96
78
+ cfg.MODEL.SWIN.DEPTHS = [2, 2, 6, 2]
79
+ cfg.MODEL.SWIN.NUM_HEADS = [3, 6, 12, 24]
80
+ cfg.MODEL.SWIN.WINDOW_SIZE = 7
81
+ cfg.MODEL.SWIN.MLP_RATIO = 4.0
82
+ cfg.MODEL.SWIN.QKV_BIAS = True
83
+ cfg.MODEL.SWIN.QK_SCALE = None
84
+ cfg.MODEL.SWIN.DROP_RATE = 0.0
85
+ cfg.MODEL.SWIN.ATTN_DROP_RATE = 0.0
86
+ cfg.MODEL.SWIN.DROP_PATH_RATE = 0.3
87
+ cfg.MODEL.SWIN.APE = False
88
+ cfg.MODEL.SWIN.PATCH_NORM = True
89
+ cfg.MODEL.SWIN.OUT_FEATURES = ["res2", "res3", "res4", "res5"]
90
+ cfg.MODEL.SWIN.USE_CHECKPOINT = False
91
+
92
+ # NOTE: maskformer2 extra configs
93
+ # transformer module
94
+ cfg.MODEL.MASK_FORMER.TRANSFORMER_DECODER_NAME = "MultiScaleMaskedTransformerDecoder"
95
+
96
+ # LSJ aug
97
+ cfg.INPUT.IMAGE_SIZE = 1024
98
+ cfg.INPUT.MIN_SCALE = 0.1
99
+ cfg.INPUT.MAX_SCALE = 2.0
100
+
101
+ # MSDeformAttn encoder configs
102
+ cfg.MODEL.SEM_SEG_HEAD.DEFORMABLE_TRANSFORMER_ENCODER_IN_FEATURES = ["res3", "res4", "res5"]
103
+ cfg.MODEL.SEM_SEG_HEAD.DEFORMABLE_TRANSFORMER_ENCODER_N_POINTS = 4
104
+ cfg.MODEL.SEM_SEG_HEAD.DEFORMABLE_TRANSFORMER_ENCODER_N_HEADS = 8
105
+
106
+ # point loss configs
107
+ # Number of points sampled during training for a mask point head.
108
+ cfg.MODEL.MASK_FORMER.TRAIN_NUM_POINTS = 112 * 112
109
+ # Oversampling parameter for PointRend point sampling during training. Parameter `k` in the
110
+ # original paper.
111
+ cfg.MODEL.MASK_FORMER.OVERSAMPLE_RATIO = 3.0
112
+ # Importance sampling parameter for PointRend point sampling during training. Parametr `beta` in
113
+ # the original paper.
114
+ cfg.MODEL.MASK_FORMER.IMPORTANCE_SAMPLE_RATIO = 0.75
mask2former/configs/ade20k/instance-segmentation/Base-ADE20K-InstanceSegmentation.yaml ADDED
@@ -0,0 +1,61 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ MODEL:
2
+ BACKBONE:
3
+ FREEZE_AT: 0
4
+ NAME: "build_resnet_backbone"
5
+ WEIGHTS: "detectron2://ImageNetPretrained/torchvision/R-50.pkl"
6
+ PIXEL_MEAN: [123.675, 116.280, 103.530]
7
+ PIXEL_STD: [58.395, 57.120, 57.375]
8
+ RESNETS:
9
+ DEPTH: 50
10
+ STEM_TYPE: "basic" # not used
11
+ STEM_OUT_CHANNELS: 64
12
+ STRIDE_IN_1X1: False
13
+ OUT_FEATURES: ["res2", "res3", "res4", "res5"]
14
+ # NORM: "SyncBN"
15
+ RES5_MULTI_GRID: [1, 1, 1] # not used
16
+ DATASETS:
17
+ TRAIN: ("ade20k_instance_train",)
18
+ TEST: ("ade20k_instance_val",)
19
+ SOLVER:
20
+ IMS_PER_BATCH: 16
21
+ BASE_LR: 0.0001
22
+ MAX_ITER: 160000
23
+ WARMUP_FACTOR: 1.0
24
+ WARMUP_ITERS: 0
25
+ WEIGHT_DECAY: 0.05
26
+ OPTIMIZER: "ADAMW"
27
+ LR_SCHEDULER_NAME: "WarmupPolyLR"
28
+ BACKBONE_MULTIPLIER: 0.1
29
+ CLIP_GRADIENTS:
30
+ ENABLED: True
31
+ CLIP_TYPE: "full_model"
32
+ CLIP_VALUE: 0.01
33
+ NORM_TYPE: 2.0
34
+ AMP:
35
+ ENABLED: True
36
+ INPUT:
37
+ MIN_SIZE_TRAIN: !!python/object/apply:eval ["[int(x * 0.1 * 640) for x in range(5, 21)]"]
38
+ MIN_SIZE_TRAIN_SAMPLING: "choice"
39
+ MIN_SIZE_TEST: 640
40
+ MAX_SIZE_TRAIN: 2560
41
+ MAX_SIZE_TEST: 2560
42
+ CROP:
43
+ ENABLED: True
44
+ TYPE: "absolute"
45
+ SIZE: (640, 640)
46
+ SINGLE_CATEGORY_MAX_AREA: 1.0
47
+ COLOR_AUG_SSD: True
48
+ SIZE_DIVISIBILITY: 640 # used in dataset mapper
49
+ FORMAT: "RGB"
50
+ DATASET_MAPPER_NAME: "mask_former_instance"
51
+ TEST:
52
+ EVAL_PERIOD: 5000
53
+ AUG:
54
+ ENABLED: False
55
+ MIN_SIZES: [320, 480, 640, 800, 960, 1120]
56
+ MAX_SIZE: 4480
57
+ FLIP: True
58
+ DATALOADER:
59
+ FILTER_EMPTY_ANNOTATIONS: True
60
+ NUM_WORKERS: 4
61
+ VERSION: 2
mask2former/configs/ade20k/instance-segmentation/maskformer2_R50_bs16_160k.yaml ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ _BASE_: Base-ADE20K-InstanceSegmentation.yaml
2
+ MODEL:
3
+ META_ARCHITECTURE: "MaskFormer"
4
+ SEM_SEG_HEAD:
5
+ NAME: "MaskFormerHead"
6
+ IGNORE_VALUE: 255
7
+ NUM_CLASSES: 100
8
+ LOSS_WEIGHT: 1.0
9
+ CONVS_DIM: 256
10
+ MASK_DIM: 256
11
+ NORM: "GN"
12
+ # pixel decoder
13
+ PIXEL_DECODER_NAME: "MSDeformAttnPixelDecoder"
14
+ IN_FEATURES: ["res2", "res3", "res4", "res5"]
15
+ DEFORMABLE_TRANSFORMER_ENCODER_IN_FEATURES: ["res3", "res4", "res5"]
16
+ COMMON_STRIDE: 4
17
+ TRANSFORMER_ENC_LAYERS: 6
18
+ MASK_FORMER:
19
+ TRANSFORMER_DECODER_NAME: "MultiScaleMaskedTransformerDecoder"
20
+ TRANSFORMER_IN_FEATURE: "multi_scale_pixel_decoder"
21
+ DEEP_SUPERVISION: True
22
+ NO_OBJECT_WEIGHT: 0.1
23
+ CLASS_WEIGHT: 2.0
24
+ MASK_WEIGHT: 5.0
25
+ DICE_WEIGHT: 5.0
26
+ HIDDEN_DIM: 256
27
+ NUM_OBJECT_QUERIES: 100
28
+ NHEADS: 8
29
+ DROPOUT: 0.0
30
+ DIM_FEEDFORWARD: 2048
31
+ ENC_LAYERS: 0
32
+ PRE_NORM: False
33
+ ENFORCE_INPUT_PROJ: False
34
+ SIZE_DIVISIBILITY: 32
35
+ DEC_LAYERS: 10 # 9 decoder layers, add one for the loss on learnable query
36
+ TRAIN_NUM_POINTS: 12544
37
+ OVERSAMPLE_RATIO: 3.0
38
+ IMPORTANCE_SAMPLE_RATIO: 0.75
39
+ TEST:
40
+ SEMANTIC_ON: True
41
+ INSTANCE_ON: True
42
+ PANOPTIC_ON: True
43
+ OVERLAP_THRESHOLD: 0.8
44
+ OBJECT_MASK_THRESHOLD: 0.8
mask2former/configs/ade20k/instance-segmentation/swin/maskformer2_swin_large_IN21k_384_bs16_160k.yaml ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ _BASE_: ../maskformer2_R50_bs16_160k.yaml
2
+ MODEL:
3
+ BACKBONE:
4
+ NAME: "D2SwinTransformer"
5
+ SWIN:
6
+ EMBED_DIM: 192
7
+ DEPTHS: [2, 2, 18, 2]
8
+ NUM_HEADS: [6, 12, 24, 48]
9
+ WINDOW_SIZE: 12
10
+ APE: False
11
+ DROP_PATH_RATE: 0.3
12
+ PATCH_NORM: True
13
+ PRETRAIN_IMG_SIZE: 384
14
+ WEIGHTS: "swin_large_patch4_window12_384_22k.pkl"
15
+ PIXEL_MEAN: [123.675, 116.280, 103.530]
16
+ PIXEL_STD: [58.395, 57.120, 57.375]
17
+ MASK_FORMER:
18
+ NUM_OBJECT_QUERIES: 200
mask2former/configs/ade20k/panoptic-segmentation/Base-ADE20K-PanopticSegmentation.yaml ADDED
@@ -0,0 +1,61 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ MODEL:
2
+ BACKBONE:
3
+ FREEZE_AT: 0
4
+ NAME: "build_resnet_backbone"
5
+ WEIGHTS: "detectron2://ImageNetPretrained/torchvision/R-50.pkl"
6
+ PIXEL_MEAN: [123.675, 116.280, 103.530]
7
+ PIXEL_STD: [58.395, 57.120, 57.375]
8
+ RESNETS:
9
+ DEPTH: 50
10
+ STEM_TYPE: "basic" # not used
11
+ STEM_OUT_CHANNELS: 64
12
+ STRIDE_IN_1X1: False
13
+ OUT_FEATURES: ["res2", "res3", "res4", "res5"]
14
+ # NORM: "SyncBN"
15
+ RES5_MULTI_GRID: [1, 1, 1] # not used
16
+ DATASETS:
17
+ TRAIN: ("ade20k_panoptic_train",)
18
+ TEST: ("ade20k_panoptic_val",)
19
+ SOLVER:
20
+ IMS_PER_BATCH: 16
21
+ BASE_LR: 0.0001
22
+ MAX_ITER: 160000
23
+ WARMUP_FACTOR: 1.0
24
+ WARMUP_ITERS: 0
25
+ WEIGHT_DECAY: 0.05
26
+ OPTIMIZER: "ADAMW"
27
+ LR_SCHEDULER_NAME: "WarmupPolyLR"
28
+ BACKBONE_MULTIPLIER: 0.1
29
+ CLIP_GRADIENTS:
30
+ ENABLED: True
31
+ CLIP_TYPE: "full_model"
32
+ CLIP_VALUE: 0.01
33
+ NORM_TYPE: 2.0
34
+ AMP:
35
+ ENABLED: True
36
+ INPUT:
37
+ MIN_SIZE_TRAIN: !!python/object/apply:eval ["[int(x * 0.1 * 640) for x in range(5, 21)]"]
38
+ MIN_SIZE_TRAIN_SAMPLING: "choice"
39
+ MIN_SIZE_TEST: 640
40
+ MAX_SIZE_TRAIN: 2560
41
+ MAX_SIZE_TEST: 2560
42
+ CROP:
43
+ ENABLED: True
44
+ TYPE: "absolute"
45
+ SIZE: (640, 640)
46
+ SINGLE_CATEGORY_MAX_AREA: 1.0
47
+ COLOR_AUG_SSD: True
48
+ SIZE_DIVISIBILITY: 640 # used in dataset mapper
49
+ FORMAT: "RGB"
50
+ DATASET_MAPPER_NAME: "mask_former_panoptic"
51
+ TEST:
52
+ EVAL_PERIOD: 5000
53
+ AUG:
54
+ ENABLED: False
55
+ MIN_SIZES: [320, 480, 640, 800, 960, 1120]
56
+ MAX_SIZE: 4480
57
+ FLIP: True
58
+ DATALOADER:
59
+ FILTER_EMPTY_ANNOTATIONS: True
60
+ NUM_WORKERS: 4
61
+ VERSION: 2
mask2former/configs/ade20k/panoptic-segmentation/maskformer2_R50_bs16_160k.yaml ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ _BASE_: Base-ADE20K-PanopticSegmentation.yaml
2
+ MODEL:
3
+ META_ARCHITECTURE: "MaskFormer"
4
+ SEM_SEG_HEAD:
5
+ NAME: "MaskFormerHead"
6
+ IGNORE_VALUE: 255
7
+ NUM_CLASSES: 150
8
+ LOSS_WEIGHT: 1.0
9
+ CONVS_DIM: 256
10
+ MASK_DIM: 256
11
+ NORM: "GN"
12
+ # pixel decoder
13
+ PIXEL_DECODER_NAME: "MSDeformAttnPixelDecoder"
14
+ IN_FEATURES: ["res2", "res3", "res4", "res5"]
15
+ DEFORMABLE_TRANSFORMER_ENCODER_IN_FEATURES: ["res3", "res4", "res5"]
16
+ COMMON_STRIDE: 4
17
+ TRANSFORMER_ENC_LAYERS: 6
18
+ MASK_FORMER:
19
+ TRANSFORMER_DECODER_NAME: "MultiScaleMaskedTransformerDecoder"
20
+ TRANSFORMER_IN_FEATURE: "multi_scale_pixel_decoder"
21
+ DEEP_SUPERVISION: True
22
+ NO_OBJECT_WEIGHT: 0.1
23
+ CLASS_WEIGHT: 2.0
24
+ MASK_WEIGHT: 5.0
25
+ DICE_WEIGHT: 5.0
26
+ HIDDEN_DIM: 256
27
+ NUM_OBJECT_QUERIES: 100
28
+ NHEADS: 8
29
+ DROPOUT: 0.0
30
+ DIM_FEEDFORWARD: 2048
31
+ ENC_LAYERS: 0
32
+ PRE_NORM: False
33
+ ENFORCE_INPUT_PROJ: False
34
+ SIZE_DIVISIBILITY: 32
35
+ DEC_LAYERS: 10 # 9 decoder layers, add one for the loss on learnable query
36
+ TRAIN_NUM_POINTS: 12544
37
+ OVERSAMPLE_RATIO: 3.0
38
+ IMPORTANCE_SAMPLE_RATIO: 0.75
39
+ TEST:
40
+ SEMANTIC_ON: True
41
+ INSTANCE_ON: True
42
+ PANOPTIC_ON: True
43
+ OVERLAP_THRESHOLD: 0.8
44
+ OBJECT_MASK_THRESHOLD: 0.8
mask2former/configs/ade20k/panoptic-segmentation/swin/maskformer2_swin_large_IN21k_384_bs16_160k.yaml ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ _BASE_: ../maskformer2_R50_bs16_160k.yaml
2
+ MODEL:
3
+ BACKBONE:
4
+ NAME: "D2SwinTransformer"
5
+ SWIN:
6
+ EMBED_DIM: 192
7
+ DEPTHS: [2, 2, 18, 2]
8
+ NUM_HEADS: [6, 12, 24, 48]
9
+ WINDOW_SIZE: 12
10
+ APE: False
11
+ DROP_PATH_RATE: 0.3
12
+ PATCH_NORM: True
13
+ PRETRAIN_IMG_SIZE: 384
14
+ WEIGHTS: "swin_large_patch4_window12_384_22k.pkl"
15
+ PIXEL_MEAN: [123.675, 116.280, 103.530]
16
+ PIXEL_STD: [58.395, 57.120, 57.375]
17
+ MASK_FORMER:
18
+ NUM_OBJECT_QUERIES: 200
mask2former/configs/ade20k/semantic-segmentation/Base-ADE20K-SemanticSegmentation.yaml ADDED
@@ -0,0 +1,61 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ MODEL:
2
+ BACKBONE:
3
+ FREEZE_AT: 0
4
+ NAME: "build_resnet_backbone"
5
+ WEIGHTS: "detectron2://ImageNetPretrained/torchvision/R-50.pkl"
6
+ PIXEL_MEAN: [123.675, 116.280, 103.530]
7
+ PIXEL_STD: [58.395, 57.120, 57.375]
8
+ RESNETS:
9
+ DEPTH: 50
10
+ STEM_TYPE: "basic" # not used
11
+ STEM_OUT_CHANNELS: 64
12
+ STRIDE_IN_1X1: False
13
+ OUT_FEATURES: ["res2", "res3", "res4", "res5"]
14
+ # NORM: "SyncBN"
15
+ RES5_MULTI_GRID: [1, 1, 1] # not used
16
+ DATASETS:
17
+ TRAIN: ("ade20k_sem_seg_train",)
18
+ TEST: ("ade20k_sem_seg_val",)
19
+ SOLVER:
20
+ IMS_PER_BATCH: 16
21
+ BASE_LR: 0.0001
22
+ MAX_ITER: 160000
23
+ WARMUP_FACTOR: 1.0
24
+ WARMUP_ITERS: 0
25
+ WEIGHT_DECAY: 0.05
26
+ OPTIMIZER: "ADAMW"
27
+ LR_SCHEDULER_NAME: "WarmupPolyLR"
28
+ BACKBONE_MULTIPLIER: 0.1
29
+ CLIP_GRADIENTS:
30
+ ENABLED: True
31
+ CLIP_TYPE: "full_model"
32
+ CLIP_VALUE: 0.01
33
+ NORM_TYPE: 2.0
34
+ AMP:
35
+ ENABLED: True
36
+ INPUT:
37
+ MIN_SIZE_TRAIN: !!python/object/apply:eval ["[int(x * 0.1 * 512) for x in range(5, 21)]"]
38
+ MIN_SIZE_TRAIN_SAMPLING: "choice"
39
+ MIN_SIZE_TEST: 512
40
+ MAX_SIZE_TRAIN: 2048
41
+ MAX_SIZE_TEST: 2048
42
+ CROP:
43
+ ENABLED: True
44
+ TYPE: "absolute"
45
+ SIZE: (512, 512)
46
+ SINGLE_CATEGORY_MAX_AREA: 1.0
47
+ COLOR_AUG_SSD: True
48
+ SIZE_DIVISIBILITY: 512 # used in dataset mapper
49
+ FORMAT: "RGB"
50
+ DATASET_MAPPER_NAME: "mask_former_semantic"
51
+ TEST:
52
+ EVAL_PERIOD: 5000
53
+ AUG:
54
+ ENABLED: False
55
+ MIN_SIZES: [256, 384, 512, 640, 768, 896]
56
+ MAX_SIZE: 3584
57
+ FLIP: True
58
+ DATALOADER:
59
+ FILTER_EMPTY_ANNOTATIONS: True
60
+ NUM_WORKERS: 4
61
+ VERSION: 2
mask2former/configs/ade20k/semantic-segmentation/maskformer2_R101_bs16_90k.yaml ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ _BASE_: maskformer2_R50_bs16_160k.yaml
2
+ MODEL:
3
+ WEIGHTS: "R-101.pkl"
4
+ RESNETS:
5
+ DEPTH: 101
6
+ STEM_TYPE: "basic" # not used
7
+ STEM_OUT_CHANNELS: 64
8
+ STRIDE_IN_1X1: False
9
+ OUT_FEATURES: ["res2", "res3", "res4", "res5"]
10
+ NORM: "SyncBN"
11
+ RES5_MULTI_GRID: [1, 1, 1] # not used
mask2former/configs/ade20k/semantic-segmentation/maskformer2_R50_bs16_160k.yaml ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ _BASE_: Base-ADE20K-SemanticSegmentation.yaml
2
+ MODEL:
3
+ META_ARCHITECTURE: "MaskFormer"
4
+ SEM_SEG_HEAD:
5
+ NAME: "MaskFormerHead"
6
+ IGNORE_VALUE: 255
7
+ NUM_CLASSES: 150
8
+ LOSS_WEIGHT: 1.0
9
+ CONVS_DIM: 256
10
+ MASK_DIM: 256
11
+ NORM: "GN"
12
+ # pixel decoder
13
+ PIXEL_DECODER_NAME: "MSDeformAttnPixelDecoder"
14
+ IN_FEATURES: ["res2", "res3", "res4", "res5"]
15
+ DEFORMABLE_TRANSFORMER_ENCODER_IN_FEATURES: ["res3", "res4", "res5"]
16
+ COMMON_STRIDE: 4
17
+ TRANSFORMER_ENC_LAYERS: 6
18
+ MASK_FORMER:
19
+ TRANSFORMER_DECODER_NAME: "MultiScaleMaskedTransformerDecoder"
20
+ TRANSFORMER_IN_FEATURE: "multi_scale_pixel_decoder"
21
+ DEEP_SUPERVISION: True
22
+ NO_OBJECT_WEIGHT: 0.1
23
+ CLASS_WEIGHT: 2.0
24
+ MASK_WEIGHT: 5.0
25
+ DICE_WEIGHT: 5.0
26
+ HIDDEN_DIM: 256
27
+ NUM_OBJECT_QUERIES: 100
28
+ NHEADS: 8
29
+ DROPOUT: 0.0
30
+ DIM_FEEDFORWARD: 2048
31
+ ENC_LAYERS: 0
32
+ PRE_NORM: False
33
+ ENFORCE_INPUT_PROJ: False
34
+ SIZE_DIVISIBILITY: 32
35
+ DEC_LAYERS: 10 # 9 decoder layers, add one for the loss on learnable query
36
+ TRAIN_NUM_POINTS: 12544
37
+ OVERSAMPLE_RATIO: 3.0
38
+ IMPORTANCE_SAMPLE_RATIO: 0.75
39
+ TEST:
40
+ SEMANTIC_ON: True
41
+ INSTANCE_ON: False
42
+ PANOPTIC_ON: False
43
+ OVERLAP_THRESHOLD: 0.8
44
+ OBJECT_MASK_THRESHOLD: 0.8
mask2former/configs/ade20k/semantic-segmentation/swin/maskformer2_swin_base_384_bs16_160k_res640.yaml ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ _BASE_: ../maskformer2_R50_bs16_160k.yaml
2
+ MODEL:
3
+ BACKBONE:
4
+ NAME: "D2SwinTransformer"
5
+ SWIN:
6
+ EMBED_DIM: 128
7
+ DEPTHS: [2, 2, 18, 2]
8
+ NUM_HEADS: [4, 8, 16, 32]
9
+ WINDOW_SIZE: 12
10
+ APE: False
11
+ DROP_PATH_RATE: 0.3
12
+ PATCH_NORM: True
13
+ PRETRAIN_IMG_SIZE: 384
14
+ WEIGHTS: "swin_base_patch4_window12_384.pkl"
15
+ PIXEL_MEAN: [123.675, 116.280, 103.530]
16
+ PIXEL_STD: [58.395, 57.120, 57.375]
17
+ INPUT:
18
+ MIN_SIZE_TRAIN: !!python/object/apply:eval ["[int(x * 0.1 * 640) for x in range(5, 21)]"]
19
+ MIN_SIZE_TRAIN_SAMPLING: "choice"
20
+ MIN_SIZE_TEST: 640
21
+ MAX_SIZE_TRAIN: 2560
22
+ MAX_SIZE_TEST: 2560
23
+ CROP:
24
+ ENABLED: True
25
+ TYPE: "absolute"
26
+ SIZE: (640, 640)
27
+ SINGLE_CATEGORY_MAX_AREA: 1.0
28
+ COLOR_AUG_SSD: True
29
+ SIZE_DIVISIBILITY: 640 # used in dataset mapper
30
+ FORMAT: "RGB"
31
+ TEST:
32
+ EVAL_PERIOD: 5000
33
+ AUG:
34
+ ENABLED: False
35
+ MIN_SIZES: [320, 480, 640, 800, 960, 1120]
36
+ MAX_SIZE: 4480
37
+ FLIP: True
mask2former/configs/ade20k/semantic-segmentation/swin/maskformer2_swin_base_IN21k_384_bs16_160k_res640.yaml ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ _BASE_: ../maskformer2_R50_bs16_160k.yaml
2
+ MODEL:
3
+ BACKBONE:
4
+ NAME: "D2SwinTransformer"
5
+ SWIN:
6
+ EMBED_DIM: 128
7
+ DEPTHS: [2, 2, 18, 2]
8
+ NUM_HEADS: [4, 8, 16, 32]
9
+ WINDOW_SIZE: 12
10
+ APE: False
11
+ DROP_PATH_RATE: 0.3
12
+ PATCH_NORM: True
13
+ PRETRAIN_IMG_SIZE: 384
14
+ WEIGHTS: "swin_base_patch4_window12_384_22k.pkl"
15
+ PIXEL_MEAN: [123.675, 116.280, 103.530]
16
+ PIXEL_STD: [58.395, 57.120, 57.375]
17
+ INPUT:
18
+ MIN_SIZE_TRAIN: !!python/object/apply:eval ["[int(x * 0.1 * 640) for x in range(5, 21)]"]
19
+ MIN_SIZE_TRAIN_SAMPLING: "choice"
20
+ MIN_SIZE_TEST: 640
21
+ MAX_SIZE_TRAIN: 2560
22
+ MAX_SIZE_TEST: 2560
23
+ CROP:
24
+ ENABLED: True
25
+ TYPE: "absolute"
26
+ SIZE: (640, 640)
27
+ SINGLE_CATEGORY_MAX_AREA: 1.0
28
+ COLOR_AUG_SSD: True
29
+ SIZE_DIVISIBILITY: 640 # used in dataset mapper
30
+ FORMAT: "RGB"
31
+ TEST:
32
+ EVAL_PERIOD: 5000
33
+ AUG:
34
+ ENABLED: False
35
+ MIN_SIZES: [320, 480, 640, 800, 960, 1120]
36
+ MAX_SIZE: 4480
37
+ FLIP: True
mask2former/configs/ade20k/semantic-segmentation/swin/maskformer2_swin_large_IN21k_384_bs16_160k_res640.yaml ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ _BASE_: ../maskformer2_R50_bs16_160k.yaml
2
+ MODEL:
3
+ BACKBONE:
4
+ NAME: "D2SwinTransformer"
5
+ SWIN:
6
+ EMBED_DIM: 192
7
+ DEPTHS: [2, 2, 18, 2]
8
+ NUM_HEADS: [6, 12, 24, 48]
9
+ WINDOW_SIZE: 12
10
+ APE: False
11
+ DROP_PATH_RATE: 0.3
12
+ PATCH_NORM: True
13
+ PRETRAIN_IMG_SIZE: 384
14
+ WEIGHTS: "swin_large_patch4_window12_384_22k.pkl"
15
+ PIXEL_MEAN: [123.675, 116.280, 103.530]
16
+ PIXEL_STD: [58.395, 57.120, 57.375]
17
+ INPUT:
18
+ MIN_SIZE_TRAIN: !!python/object/apply:eval ["[int(x * 0.1 * 640) for x in range(5, 21)]"]
19
+ MIN_SIZE_TRAIN_SAMPLING: "choice"
20
+ MIN_SIZE_TEST: 640
21
+ MAX_SIZE_TRAIN: 2560
22
+ MAX_SIZE_TEST: 2560
23
+ CROP:
24
+ ENABLED: True
25
+ TYPE: "absolute"
26
+ SIZE: (640, 640)
27
+ SINGLE_CATEGORY_MAX_AREA: 1.0
28
+ COLOR_AUG_SSD: True
29
+ SIZE_DIVISIBILITY: 640 # used in dataset mapper
30
+ FORMAT: "RGB"
31
+ TEST:
32
+ EVAL_PERIOD: 5000
33
+ AUG:
34
+ ENABLED: False
35
+ MIN_SIZES: [320, 480, 640, 800, 960, 1120]
36
+ MAX_SIZE: 4480
37
+ FLIP: True
mask2former/configs/ade20k/semantic-segmentation/swin/maskformer2_swin_small_bs16_160k.yaml ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ _BASE_: ../maskformer2_R50_bs16_160k.yaml
2
+ MODEL:
3
+ BACKBONE:
4
+ NAME: "D2SwinTransformer"
5
+ SWIN:
6
+ EMBED_DIM: 96
7
+ DEPTHS: [2, 2, 18, 2]
8
+ NUM_HEADS: [3, 6, 12, 24]
9
+ WINDOW_SIZE: 7
10
+ APE: False
11
+ DROP_PATH_RATE: 0.3
12
+ PATCH_NORM: True
13
+ WEIGHTS: "swin_small_patch4_window7_224.pkl"
14
+ PIXEL_MEAN: [123.675, 116.280, 103.530]
15
+ PIXEL_STD: [58.395, 57.120, 57.375]
mask2former/configs/ade20k/semantic-segmentation/swin/maskformer2_swin_tiny_bs16_160k.yaml ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ _BASE_: ../maskformer2_R50_bs16_160k.yaml
2
+ MODEL:
3
+ BACKBONE:
4
+ NAME: "D2SwinTransformer"
5
+ SWIN:
6
+ EMBED_DIM: 96
7
+ DEPTHS: [2, 2, 6, 2]
8
+ NUM_HEADS: [3, 6, 12, 24]
9
+ WINDOW_SIZE: 7
10
+ APE: False
11
+ DROP_PATH_RATE: 0.3
12
+ PATCH_NORM: True
13
+ WEIGHTS: "swin_tiny_patch4_window7_224.pkl"
14
+ PIXEL_MEAN: [123.675, 116.280, 103.530]
15
+ PIXEL_STD: [58.395, 57.120, 57.375]
mask2former/configs/cityscapes/instance-segmentation/Base-Cityscapes-InstanceSegmentation.yaml ADDED
@@ -0,0 +1,61 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ MODEL:
2
+ BACKBONE:
3
+ FREEZE_AT: 0
4
+ NAME: "build_resnet_backbone"
5
+ WEIGHTS: "detectron2://ImageNetPretrained/torchvision/R-50.pkl"
6
+ PIXEL_MEAN: [123.675, 116.280, 103.530]
7
+ PIXEL_STD: [58.395, 57.120, 57.375]
8
+ RESNETS:
9
+ DEPTH: 50
10
+ STEM_TYPE: "basic" # not used
11
+ STEM_OUT_CHANNELS: 64
12
+ STRIDE_IN_1X1: False
13
+ OUT_FEATURES: ["res2", "res3", "res4", "res5"]
14
+ NORM: "SyncBN" # use syncbn for cityscapes dataset
15
+ RES5_MULTI_GRID: [1, 1, 1] # not used
16
+ DATASETS:
17
+ TRAIN: ("cityscapes_fine_instance_seg_train",)
18
+ TEST: ("cityscapes_fine_instance_seg_val",)
19
+ SOLVER:
20
+ IMS_PER_BATCH: 16
21
+ BASE_LR: 0.0001
22
+ MAX_ITER: 90000
23
+ WARMUP_FACTOR: 1.0
24
+ WARMUP_ITERS: 0
25
+ WEIGHT_DECAY: 0.05
26
+ OPTIMIZER: "ADAMW"
27
+ LR_SCHEDULER_NAME: "WarmupPolyLR"
28
+ BACKBONE_MULTIPLIER: 0.1
29
+ CLIP_GRADIENTS:
30
+ ENABLED: True
31
+ CLIP_TYPE: "full_model"
32
+ CLIP_VALUE: 0.01
33
+ NORM_TYPE: 2.0
34
+ AMP:
35
+ ENABLED: True
36
+ INPUT:
37
+ MIN_SIZE_TRAIN: !!python/object/apply:eval ["[int(x * 0.1 * 1024) for x in range(5, 21)]"]
38
+ MIN_SIZE_TRAIN_SAMPLING: "choice"
39
+ MIN_SIZE_TEST: 1024
40
+ MAX_SIZE_TRAIN: 4096
41
+ MAX_SIZE_TEST: 2048
42
+ CROP:
43
+ ENABLED: True
44
+ TYPE: "absolute"
45
+ SIZE: (512, 1024)
46
+ SINGLE_CATEGORY_MAX_AREA: 1.0
47
+ COLOR_AUG_SSD: True
48
+ SIZE_DIVISIBILITY: -1
49
+ FORMAT: "RGB"
50
+ DATASET_MAPPER_NAME: "mask_former_instance"
51
+ TEST:
52
+ EVAL_PERIOD: 5000
53
+ AUG:
54
+ ENABLED: False
55
+ MIN_SIZES: [512, 768, 1024, 1280, 1536, 1792]
56
+ MAX_SIZE: 4096
57
+ FLIP: True
58
+ DATALOADER:
59
+ FILTER_EMPTY_ANNOTATIONS: True
60
+ NUM_WORKERS: 4
61
+ VERSION: 2
mask2former/configs/cityscapes/instance-segmentation/maskformer2_R101_bs16_90k.yaml ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ _BASE_: maskformer2_R50_bs16_90k.yaml
2
+ MODEL:
3
+ WEIGHTS: "R-101.pkl"
4
+ RESNETS:
5
+ DEPTH: 101
6
+ STEM_TYPE: "basic" # not used
7
+ STEM_OUT_CHANNELS: 64
8
+ STRIDE_IN_1X1: False
9
+ OUT_FEATURES: ["res2", "res3", "res4", "res5"]
10
+ NORM: "SyncBN"
11
+ RES5_MULTI_GRID: [1, 1, 1] # not used
mask2former/configs/cityscapes/instance-segmentation/maskformer2_R50_bs16_90k.yaml ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ _BASE_: Base-Cityscapes-InstanceSegmentation.yaml
2
+ MODEL:
3
+ META_ARCHITECTURE: "MaskFormer"
4
+ SEM_SEG_HEAD:
5
+ NAME: "MaskFormerHead"
6
+ IGNORE_VALUE: 255
7
+ NUM_CLASSES: 8
8
+ LOSS_WEIGHT: 1.0
9
+ CONVS_DIM: 256
10
+ MASK_DIM: 256
11
+ NORM: "GN"
12
+ # pixel decoder
13
+ PIXEL_DECODER_NAME: "MSDeformAttnPixelDecoder"
14
+ IN_FEATURES: ["res2", "res3", "res4", "res5"]
15
+ DEFORMABLE_TRANSFORMER_ENCODER_IN_FEATURES: ["res3", "res4", "res5"]
16
+ COMMON_STRIDE: 4
17
+ TRANSFORMER_ENC_LAYERS: 6
18
+ MASK_FORMER:
19
+ TRANSFORMER_DECODER_NAME: "MultiScaleMaskedTransformerDecoder"
20
+ TRANSFORMER_IN_FEATURE: "multi_scale_pixel_decoder"
21
+ DEEP_SUPERVISION: True
22
+ NO_OBJECT_WEIGHT: 0.1
23
+ CLASS_WEIGHT: 2.0
24
+ MASK_WEIGHT: 5.0
25
+ DICE_WEIGHT: 5.0
26
+ HIDDEN_DIM: 256
27
+ NUM_OBJECT_QUERIES: 100
28
+ NHEADS: 8
29
+ DROPOUT: 0.0
30
+ DIM_FEEDFORWARD: 2048
31
+ ENC_LAYERS: 0
32
+ PRE_NORM: False
33
+ ENFORCE_INPUT_PROJ: False
34
+ SIZE_DIVISIBILITY: 32
35
+ DEC_LAYERS: 10 # 9 decoder layers, add one for the loss on learnable query
36
+ TRAIN_NUM_POINTS: 12544
37
+ OVERSAMPLE_RATIO: 3.0
38
+ IMPORTANCE_SAMPLE_RATIO: 0.75
39
+ TEST:
40
+ SEMANTIC_ON: False
41
+ INSTANCE_ON: True
42
+ PANOPTIC_ON: False
43
+ OVERLAP_THRESHOLD: 0.8
44
+ OBJECT_MASK_THRESHOLD: 0.8
mask2former/configs/cityscapes/instance-segmentation/swin/maskformer2_swin_base_IN21k_384_bs16_90k.yaml ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ _BASE_: ../maskformer2_R50_bs16_90k.yaml
2
+ MODEL:
3
+ BACKBONE:
4
+ NAME: "D2SwinTransformer"
5
+ SWIN:
6
+ EMBED_DIM: 128
7
+ DEPTHS: [2, 2, 18, 2]
8
+ NUM_HEADS: [4, 8, 16, 32]
9
+ WINDOW_SIZE: 12
10
+ APE: False
11
+ DROP_PATH_RATE: 0.3
12
+ PATCH_NORM: True
13
+ PRETRAIN_IMG_SIZE: 384
14
+ WEIGHTS: "swin_base_patch4_window12_384_22k.pkl"
15
+ PIXEL_MEAN: [123.675, 116.280, 103.530]
16
+ PIXEL_STD: [58.395, 57.120, 57.375]
mask2former/configs/cityscapes/instance-segmentation/swin/maskformer2_swin_large_IN21k_384_bs16_90k.yaml ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ _BASE_: ../maskformer2_R50_bs16_90k.yaml
2
+ MODEL:
3
+ BACKBONE:
4
+ NAME: "D2SwinTransformer"
5
+ SWIN:
6
+ EMBED_DIM: 192
7
+ DEPTHS: [2, 2, 18, 2]
8
+ NUM_HEADS: [6, 12, 24, 48]
9
+ WINDOW_SIZE: 12
10
+ APE: False
11
+ DROP_PATH_RATE: 0.3
12
+ PATCH_NORM: True
13
+ PRETRAIN_IMG_SIZE: 384
14
+ WEIGHTS: "swin_large_patch4_window12_384_22k.pkl"
15
+ PIXEL_MEAN: [123.675, 116.280, 103.530]
16
+ PIXEL_STD: [58.395, 57.120, 57.375]
17
+ MASK_FORMER:
18
+ NUM_OBJECT_QUERIES: 200
mask2former/configs/cityscapes/instance-segmentation/swin/maskformer2_swin_small_bs16_90k.yaml ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ _BASE_: ../maskformer2_R50_bs16_90k.yaml
2
+ MODEL:
3
+ BACKBONE:
4
+ NAME: "D2SwinTransformer"
5
+ SWIN:
6
+ EMBED_DIM: 96
7
+ DEPTHS: [2, 2, 18, 2]
8
+ NUM_HEADS: [3, 6, 12, 24]
9
+ WINDOW_SIZE: 7
10
+ APE: False
11
+ DROP_PATH_RATE: 0.3
12
+ PATCH_NORM: True
13
+ WEIGHTS: "swin_small_patch4_window7_224.pkl"
14
+ PIXEL_MEAN: [123.675, 116.280, 103.530]
15
+ PIXEL_STD: [58.395, 57.120, 57.375]
mask2former/configs/cityscapes/instance-segmentation/swin/maskformer2_swin_tiny_bs16_90k.yaml ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ _BASE_: ../maskformer2_R50_bs16_90k.yaml
2
+ MODEL:
3
+ BACKBONE:
4
+ NAME: "D2SwinTransformer"
5
+ SWIN:
6
+ EMBED_DIM: 96
7
+ DEPTHS: [2, 2, 6, 2]
8
+ NUM_HEADS: [3, 6, 12, 24]
9
+ WINDOW_SIZE: 7
10
+ APE: False
11
+ DROP_PATH_RATE: 0.3
12
+ PATCH_NORM: True
13
+ WEIGHTS: "swin_tiny_patch4_window7_224.pkl"
14
+ PIXEL_MEAN: [123.675, 116.280, 103.530]
15
+ PIXEL_STD: [58.395, 57.120, 57.375]
mask2former/configs/cityscapes/panoptic-segmentation/Base-Cityscapes-PanopticSegmentation.yaml ADDED
@@ -0,0 +1,61 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ MODEL:
2
+ BACKBONE:
3
+ FREEZE_AT: 0
4
+ NAME: "build_resnet_backbone"
5
+ WEIGHTS: "detectron2://ImageNetPretrained/torchvision/R-50.pkl"
6
+ PIXEL_MEAN: [123.675, 116.280, 103.530]
7
+ PIXEL_STD: [58.395, 57.120, 57.375]
8
+ RESNETS:
9
+ DEPTH: 50
10
+ STEM_TYPE: "basic" # not used
11
+ STEM_OUT_CHANNELS: 64
12
+ STRIDE_IN_1X1: False
13
+ OUT_FEATURES: ["res2", "res3", "res4", "res5"]
14
+ NORM: "SyncBN" # use syncbn for cityscapes dataset
15
+ RES5_MULTI_GRID: [1, 1, 1] # not used
16
+ DATASETS:
17
+ TRAIN: ("cityscapes_fine_panoptic_train",)
18
+ TEST: ("cityscapes_fine_panoptic_val",)
19
+ SOLVER:
20
+ IMS_PER_BATCH: 16
21
+ BASE_LR: 0.0001
22
+ MAX_ITER: 90000
23
+ WARMUP_FACTOR: 1.0
24
+ WARMUP_ITERS: 0
25
+ WEIGHT_DECAY: 0.05
26
+ OPTIMIZER: "ADAMW"
27
+ LR_SCHEDULER_NAME: "WarmupPolyLR"
28
+ BACKBONE_MULTIPLIER: 0.1
29
+ CLIP_GRADIENTS:
30
+ ENABLED: True
31
+ CLIP_TYPE: "full_model"
32
+ CLIP_VALUE: 0.01
33
+ NORM_TYPE: 2.0
34
+ AMP:
35
+ ENABLED: True
36
+ INPUT:
37
+ MIN_SIZE_TRAIN: !!python/object/apply:eval ["[int(x * 0.1 * 1024) for x in range(5, 21)]"]
38
+ MIN_SIZE_TRAIN_SAMPLING: "choice"
39
+ MIN_SIZE_TEST: 1024
40
+ MAX_SIZE_TRAIN: 4096
41
+ MAX_SIZE_TEST: 2048
42
+ CROP:
43
+ ENABLED: True
44
+ TYPE: "absolute"
45
+ SIZE: (512, 1024)
46
+ SINGLE_CATEGORY_MAX_AREA: 1.0
47
+ COLOR_AUG_SSD: True
48
+ SIZE_DIVISIBILITY: -1
49
+ FORMAT: "RGB"
50
+ DATASET_MAPPER_NAME: "mask_former_panoptic"
51
+ TEST:
52
+ EVAL_PERIOD: 5000
53
+ AUG:
54
+ ENABLED: False
55
+ MIN_SIZES: [512, 768, 1024, 1280, 1536, 1792]
56
+ MAX_SIZE: 4096
57
+ FLIP: True
58
+ DATALOADER:
59
+ FILTER_EMPTY_ANNOTATIONS: True
60
+ NUM_WORKERS: 4
61
+ VERSION: 2
mask2former/configs/cityscapes/panoptic-segmentation/maskformer2_R101_bs16_90k.yaml ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ _BASE_: maskformer2_R50_bs16_90k.yaml
2
+ MODEL:
3
+ WEIGHTS: "R-101.pkl"
4
+ RESNETS:
5
+ DEPTH: 101
6
+ STEM_TYPE: "basic" # not used
7
+ STEM_OUT_CHANNELS: 64
8
+ STRIDE_IN_1X1: False
9
+ OUT_FEATURES: ["res2", "res3", "res4", "res5"]
10
+ NORM: "SyncBN"
11
+ RES5_MULTI_GRID: [1, 1, 1] # not used
mask2former/configs/cityscapes/panoptic-segmentation/maskformer2_R50_bs16_90k.yaml ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ _BASE_: Base-Cityscapes-PanopticSegmentation.yaml
2
+ MODEL:
3
+ META_ARCHITECTURE: "MaskFormer"
4
+ SEM_SEG_HEAD:
5
+ NAME: "MaskFormerHead"
6
+ IGNORE_VALUE: 255
7
+ NUM_CLASSES: 19
8
+ LOSS_WEIGHT: 1.0
9
+ CONVS_DIM: 256
10
+ MASK_DIM: 256
11
+ NORM: "GN"
12
+ # pixel decoder
13
+ PIXEL_DECODER_NAME: "MSDeformAttnPixelDecoder"
14
+ IN_FEATURES: ["res2", "res3", "res4", "res5"]
15
+ DEFORMABLE_TRANSFORMER_ENCODER_IN_FEATURES: ["res3", "res4", "res5"]
16
+ COMMON_STRIDE: 4
17
+ TRANSFORMER_ENC_LAYERS: 6
18
+ MASK_FORMER:
19
+ TRANSFORMER_DECODER_NAME: "MultiScaleMaskedTransformerDecoder"
20
+ TRANSFORMER_IN_FEATURE: "multi_scale_pixel_decoder"
21
+ DEEP_SUPERVISION: True
22
+ NO_OBJECT_WEIGHT: 0.1
23
+ CLASS_WEIGHT: 2.0
24
+ MASK_WEIGHT: 5.0
25
+ DICE_WEIGHT: 5.0
26
+ HIDDEN_DIM: 256
27
+ NUM_OBJECT_QUERIES: 100
28
+ NHEADS: 8
29
+ DROPOUT: 0.0
30
+ DIM_FEEDFORWARD: 2048
31
+ ENC_LAYERS: 0
32
+ PRE_NORM: False
33
+ ENFORCE_INPUT_PROJ: False
34
+ SIZE_DIVISIBILITY: 32
35
+ DEC_LAYERS: 10 # 9 decoder layers, add one for the loss on learnable query
36
+ TRAIN_NUM_POINTS: 12544
37
+ OVERSAMPLE_RATIO: 3.0
38
+ IMPORTANCE_SAMPLE_RATIO: 0.75
39
+ TEST:
40
+ SEMANTIC_ON: True
41
+ INSTANCE_ON: True
42
+ PANOPTIC_ON: True
43
+ OVERLAP_THRESHOLD: 0.8
44
+ OBJECT_MASK_THRESHOLD: 0.8
mask2former/configs/cityscapes/panoptic-segmentation/swin/maskformer2_swin_base_IN21k_384_bs16_90k.yaml ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ _BASE_: ../maskformer2_R50_bs16_90k.yaml
2
+ MODEL:
3
+ BACKBONE:
4
+ NAME: "D2SwinTransformer"
5
+ SWIN:
6
+ EMBED_DIM: 128
7
+ DEPTHS: [2, 2, 18, 2]
8
+ NUM_HEADS: [4, 8, 16, 32]
9
+ WINDOW_SIZE: 12
10
+ APE: False
11
+ DROP_PATH_RATE: 0.3
12
+ PATCH_NORM: True
13
+ PRETRAIN_IMG_SIZE: 384
14
+ WEIGHTS: "swin_base_patch4_window12_384_22k.pkl"
15
+ PIXEL_MEAN: [123.675, 116.280, 103.530]
16
+ PIXEL_STD: [58.395, 57.120, 57.375]
mask2former/configs/cityscapes/panoptic-segmentation/swin/maskformer2_swin_large_IN21k_384_bs16_90k.yaml ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ _BASE_: ../maskformer2_R50_bs16_90k.yaml
2
+ MODEL:
3
+ BACKBONE:
4
+ NAME: "D2SwinTransformer"
5
+ SWIN:
6
+ EMBED_DIM: 192
7
+ DEPTHS: [2, 2, 18, 2]
8
+ NUM_HEADS: [6, 12, 24, 48]
9
+ WINDOW_SIZE: 12
10
+ APE: False
11
+ DROP_PATH_RATE: 0.3
12
+ PATCH_NORM: True
13
+ PRETRAIN_IMG_SIZE: 384
14
+ WEIGHTS: "swin_large_patch4_window12_384_22k.pkl"
15
+ PIXEL_MEAN: [123.675, 116.280, 103.530]
16
+ PIXEL_STD: [58.395, 57.120, 57.375]
17
+ MASK_FORMER:
18
+ NUM_OBJECT_QUERIES: 200
mask2former/configs/cityscapes/panoptic-segmentation/swin/maskformer2_swin_small_bs16_90k.yaml ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ _BASE_: ../maskformer2_R50_bs16_90k.yaml
2
+ MODEL:
3
+ BACKBONE:
4
+ NAME: "D2SwinTransformer"
5
+ SWIN:
6
+ EMBED_DIM: 96
7
+ DEPTHS: [2, 2, 18, 2]
8
+ NUM_HEADS: [3, 6, 12, 24]
9
+ WINDOW_SIZE: 7
10
+ APE: False
11
+ DROP_PATH_RATE: 0.3
12
+ PATCH_NORM: True
13
+ WEIGHTS: "swin_small_patch4_window7_224.pkl"
14
+ PIXEL_MEAN: [123.675, 116.280, 103.530]
15
+ PIXEL_STD: [58.395, 57.120, 57.375]
mask2former/configs/cityscapes/panoptic-segmentation/swin/maskformer2_swin_tiny_bs16_90k.yaml ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ _BASE_: ../maskformer2_R50_bs16_90k.yaml
2
+ MODEL:
3
+ BACKBONE:
4
+ NAME: "D2SwinTransformer"
5
+ SWIN:
6
+ EMBED_DIM: 96
7
+ DEPTHS: [2, 2, 6, 2]
8
+ NUM_HEADS: [3, 6, 12, 24]
9
+ WINDOW_SIZE: 7
10
+ APE: False
11
+ DROP_PATH_RATE: 0.3
12
+ PATCH_NORM: True
13
+ WEIGHTS: "swin_tiny_patch4_window7_224.pkl"
14
+ PIXEL_MEAN: [123.675, 116.280, 103.530]
15
+ PIXEL_STD: [58.395, 57.120, 57.375]
mask2former/configs/cityscapes/semantic-segmentation/Base-Cityscapes-SemanticSegmentation.yaml ADDED
@@ -0,0 +1,61 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ MODEL:
2
+ BACKBONE:
3
+ FREEZE_AT: 0
4
+ NAME: "build_resnet_backbone"
5
+ WEIGHTS: "detectron2://ImageNetPretrained/torchvision/R-50.pkl"
6
+ PIXEL_MEAN: [123.675, 116.280, 103.530]
7
+ PIXEL_STD: [58.395, 57.120, 57.375]
8
+ RESNETS:
9
+ DEPTH: 50
10
+ STEM_TYPE: "basic" # not used
11
+ STEM_OUT_CHANNELS: 64
12
+ STRIDE_IN_1X1: False
13
+ OUT_FEATURES: ["res2", "res3", "res4", "res5"]
14
+ NORM: "SyncBN" # use syncbn for cityscapes dataset
15
+ RES5_MULTI_GRID: [1, 1, 1] # not used
16
+ DATASETS:
17
+ TRAIN: ("cityscapes_fine_sem_seg_train",)
18
+ TEST: ("cityscapes_fine_sem_seg_val",)
19
+ SOLVER:
20
+ IMS_PER_BATCH: 16
21
+ BASE_LR: 0.0001
22
+ MAX_ITER: 90000
23
+ WARMUP_FACTOR: 1.0
24
+ WARMUP_ITERS: 0
25
+ WEIGHT_DECAY: 0.05
26
+ OPTIMIZER: "ADAMW"
27
+ LR_SCHEDULER_NAME: "WarmupPolyLR"
28
+ BACKBONE_MULTIPLIER: 0.1
29
+ CLIP_GRADIENTS:
30
+ ENABLED: True
31
+ CLIP_TYPE: "full_model"
32
+ CLIP_VALUE: 0.01
33
+ NORM_TYPE: 2.0
34
+ AMP:
35
+ ENABLED: True
36
+ INPUT:
37
+ MIN_SIZE_TRAIN: !!python/object/apply:eval ["[int(x * 0.1 * 1024) for x in range(5, 21)]"]
38
+ MIN_SIZE_TRAIN_SAMPLING: "choice"
39
+ MIN_SIZE_TEST: 1024
40
+ MAX_SIZE_TRAIN: 4096
41
+ MAX_SIZE_TEST: 2048
42
+ CROP:
43
+ ENABLED: True
44
+ TYPE: "absolute"
45
+ SIZE: (512, 1024)
46
+ SINGLE_CATEGORY_MAX_AREA: 1.0
47
+ COLOR_AUG_SSD: True
48
+ SIZE_DIVISIBILITY: -1
49
+ FORMAT: "RGB"
50
+ DATASET_MAPPER_NAME: "mask_former_semantic"
51
+ TEST:
52
+ EVAL_PERIOD: 5000
53
+ AUG:
54
+ ENABLED: False
55
+ MIN_SIZES: [512, 768, 1024, 1280, 1536, 1792]
56
+ MAX_SIZE: 4096
57
+ FLIP: True
58
+ DATALOADER:
59
+ FILTER_EMPTY_ANNOTATIONS: True
60
+ NUM_WORKERS: 4
61
+ VERSION: 2
mask2former/configs/cityscapes/semantic-segmentation/maskformer2_R101_bs16_90k.yaml ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ _BASE_: maskformer2_R50_bs16_90k.yaml
2
+ MODEL:
3
+ WEIGHTS: "R-101.pkl"
4
+ RESNETS:
5
+ DEPTH: 101
6
+ STEM_TYPE: "basic" # not used
7
+ STEM_OUT_CHANNELS: 64
8
+ STRIDE_IN_1X1: False
9
+ OUT_FEATURES: ["res2", "res3", "res4", "res5"]
10
+ NORM: "SyncBN"
11
+ RES5_MULTI_GRID: [1, 1, 1] # not used
mask2former/configs/cityscapes/semantic-segmentation/maskformer2_R50_bs16_90k.yaml ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ _BASE_: Base-Cityscapes-SemanticSegmentation.yaml
2
+ MODEL:
3
+ META_ARCHITECTURE: "MaskFormer"
4
+ SEM_SEG_HEAD:
5
+ NAME: "MaskFormerHead"
6
+ IGNORE_VALUE: 255
7
+ NUM_CLASSES: 19
8
+ LOSS_WEIGHT: 1.0
9
+ CONVS_DIM: 256
10
+ MASK_DIM: 256
11
+ NORM: "GN"
12
+ # pixel decoder
13
+ PIXEL_DECODER_NAME: "MSDeformAttnPixelDecoder"
14
+ IN_FEATURES: ["res2", "res3", "res4", "res5"]
15
+ DEFORMABLE_TRANSFORMER_ENCODER_IN_FEATURES: ["res3", "res4", "res5"]
16
+ COMMON_STRIDE: 4
17
+ TRANSFORMER_ENC_LAYERS: 6
18
+ MASK_FORMER:
19
+ TRANSFORMER_DECODER_NAME: "MultiScaleMaskedTransformerDecoder"
20
+ TRANSFORMER_IN_FEATURE: "multi_scale_pixel_decoder"
21
+ DEEP_SUPERVISION: True
22
+ NO_OBJECT_WEIGHT: 0.1
23
+ CLASS_WEIGHT: 2.0
24
+ MASK_WEIGHT: 5.0
25
+ DICE_WEIGHT: 5.0
26
+ HIDDEN_DIM: 256
27
+ NUM_OBJECT_QUERIES: 100
28
+ NHEADS: 8
29
+ DROPOUT: 0.0
30
+ DIM_FEEDFORWARD: 2048
31
+ ENC_LAYERS: 0
32
+ PRE_NORM: False
33
+ ENFORCE_INPUT_PROJ: False
34
+ SIZE_DIVISIBILITY: 32
35
+ DEC_LAYERS: 10 # 9 decoder layers, add one for the loss on learnable query
36
+ TRAIN_NUM_POINTS: 12544
37
+ OVERSAMPLE_RATIO: 3.0
38
+ IMPORTANCE_SAMPLE_RATIO: 0.75
39
+ TEST:
40
+ SEMANTIC_ON: True
41
+ INSTANCE_ON: False
42
+ PANOPTIC_ON: False
43
+ OVERLAP_THRESHOLD: 0.8
44
+ OBJECT_MASK_THRESHOLD: 0.8
mask2former/configs/cityscapes/semantic-segmentation/swin/maskformer2_swin_base_IN21k_384_bs16_90k.yaml ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ _BASE_: ../maskformer2_R50_bs16_90k.yaml
2
+ MODEL:
3
+ BACKBONE:
4
+ NAME: "D2SwinTransformer"
5
+ SWIN:
6
+ EMBED_DIM: 128
7
+ DEPTHS: [2, 2, 18, 2]
8
+ NUM_HEADS: [4, 8, 16, 32]
9
+ WINDOW_SIZE: 12
10
+ APE: False
11
+ DROP_PATH_RATE: 0.3
12
+ PATCH_NORM: True
13
+ PRETRAIN_IMG_SIZE: 384
14
+ WEIGHTS: "swin_base_patch4_window12_384_22k.pkl"
15
+ PIXEL_MEAN: [123.675, 116.280, 103.530]
16
+ PIXEL_STD: [58.395, 57.120, 57.375]
mask2former/configs/cityscapes/semantic-segmentation/swin/maskformer2_swin_large_IN21k_384_bs16_90k.yaml ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ _BASE_: ../maskformer2_R50_bs16_90k.yaml
2
+ MODEL:
3
+ BACKBONE:
4
+ NAME: "D2SwinTransformer"
5
+ SWIN:
6
+ EMBED_DIM: 192
7
+ DEPTHS: [2, 2, 18, 2]
8
+ NUM_HEADS: [6, 12, 24, 48]
9
+ WINDOW_SIZE: 12
10
+ APE: False
11
+ DROP_PATH_RATE: 0.3
12
+ PATCH_NORM: True
13
+ PRETRAIN_IMG_SIZE: 384
14
+ WEIGHTS: "swin_large_patch4_window12_384_22k.pkl"
15
+ PIXEL_MEAN: [123.675, 116.280, 103.530]
16
+ PIXEL_STD: [58.395, 57.120, 57.375]
17
+ MASK_FORMER:
18
+ NUM_OBJECT_QUERIES: 100
mask2former/configs/cityscapes/semantic-segmentation/swin/maskformer2_swin_small_bs16_90k.yaml ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ _BASE_: ../maskformer2_R50_bs16_90k.yaml
2
+ MODEL:
3
+ BACKBONE:
4
+ NAME: "D2SwinTransformer"
5
+ SWIN:
6
+ EMBED_DIM: 96
7
+ DEPTHS: [2, 2, 18, 2]
8
+ NUM_HEADS: [3, 6, 12, 24]
9
+ WINDOW_SIZE: 7
10
+ APE: False
11
+ DROP_PATH_RATE: 0.3
12
+ PATCH_NORM: True
13
+ WEIGHTS: "swin_small_patch4_window7_224.pkl"
14
+ PIXEL_MEAN: [123.675, 116.280, 103.530]
15
+ PIXEL_STD: [58.395, 57.120, 57.375]
mask2former/configs/cityscapes/semantic-segmentation/swin/maskformer2_swin_tiny_bs16_90k.yaml ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ _BASE_: ../maskformer2_R50_bs16_90k.yaml
2
+ MODEL:
3
+ BACKBONE:
4
+ NAME: "D2SwinTransformer"
5
+ SWIN:
6
+ EMBED_DIM: 96
7
+ DEPTHS: [2, 2, 6, 2]
8
+ NUM_HEADS: [3, 6, 12, 24]
9
+ WINDOW_SIZE: 7
10
+ APE: False
11
+ DROP_PATH_RATE: 0.3
12
+ PATCH_NORM: True
13
+ WEIGHTS: "swin_tiny_patch4_window7_224.pkl"
14
+ PIXEL_MEAN: [123.675, 116.280, 103.530]
15
+ PIXEL_STD: [58.395, 57.120, 57.375]
mask2former/configs/coco/instance-segmentation/Base-COCO-InstanceSegmentation.yaml ADDED
@@ -0,0 +1,47 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ MODEL:
2
+ BACKBONE:
3
+ FREEZE_AT: 0
4
+ NAME: "build_resnet_backbone"
5
+ WEIGHTS: "detectron2://ImageNetPretrained/torchvision/R-50.pkl"
6
+ PIXEL_MEAN: [123.675, 116.280, 103.530]
7
+ PIXEL_STD: [58.395, 57.120, 57.375]
8
+ RESNETS:
9
+ DEPTH: 50
10
+ STEM_TYPE: "basic" # not used
11
+ STEM_OUT_CHANNELS: 64
12
+ STRIDE_IN_1X1: False
13
+ OUT_FEATURES: ["res2", "res3", "res4", "res5"]
14
+ # NORM: "SyncBN"
15
+ RES5_MULTI_GRID: [1, 1, 1] # not used
16
+ DATASETS:
17
+ TRAIN: ("coco_2017_train",)
18
+ TEST: ("coco_2017_val",)
19
+ SOLVER:
20
+ IMS_PER_BATCH: 16
21
+ BASE_LR: 0.0001
22
+ STEPS: (327778, 355092)
23
+ MAX_ITER: 368750
24
+ WARMUP_FACTOR: 1.0
25
+ WARMUP_ITERS: 10
26
+ WEIGHT_DECAY: 0.05
27
+ OPTIMIZER: "ADAMW"
28
+ BACKBONE_MULTIPLIER: 0.1
29
+ CLIP_GRADIENTS:
30
+ ENABLED: True
31
+ CLIP_TYPE: "full_model"
32
+ CLIP_VALUE: 0.01
33
+ NORM_TYPE: 2.0
34
+ AMP:
35
+ ENABLED: True
36
+ INPUT:
37
+ IMAGE_SIZE: 1024
38
+ MIN_SCALE: 0.1
39
+ MAX_SCALE: 2.0
40
+ FORMAT: "RGB"
41
+ DATASET_MAPPER_NAME: "coco_instance_lsj"
42
+ TEST:
43
+ EVAL_PERIOD: 5000
44
+ DATALOADER:
45
+ FILTER_EMPTY_ANNOTATIONS: True
46
+ NUM_WORKERS: 4
47
+ VERSION: 2
mask2former/configs/coco/instance-segmentation/maskformer2_R101_bs16_50ep.yaml ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ _BASE_: maskformer2_R50_bs16_50ep.yaml
2
+ MODEL:
3
+ WEIGHTS: "R-101.pkl"
4
+ RESNETS:
5
+ DEPTH: 101
6
+ STEM_TYPE: "basic" # not used
7
+ STEM_OUT_CHANNELS: 64
8
+ STRIDE_IN_1X1: False
9
+ OUT_FEATURES: ["res2", "res3", "res4", "res5"]
10
+ # NORM: "SyncBN"
11
+ RES5_MULTI_GRID: [1, 1, 1] # not used