Spaces:
Runtime error
Runtime error
Bachmann Roman Christian
commited on
Commit
·
3b49518
1
Parent(s):
bb10473
Initial commit
Browse filesThis view is limited to 50 files because it contains too many changes.
See raw diff
- .gitignore +1 -0
- FINETUNING.md +126 -0
- LICENSE +399 -0
- app.py +405 -0
- dpt/__init__.py +0 -0
- dpt/base_model.py +16 -0
- dpt/blocks.py +383 -0
- dpt/midas_net.py +77 -0
- dpt/models.py +153 -0
- dpt/transforms.py +231 -0
- dpt/vit.py +576 -0
- mask2former/__init__.py +26 -0
- mask2former/config.py +114 -0
- mask2former/configs/ade20k/instance-segmentation/Base-ADE20K-InstanceSegmentation.yaml +61 -0
- mask2former/configs/ade20k/instance-segmentation/maskformer2_R50_bs16_160k.yaml +44 -0
- mask2former/configs/ade20k/instance-segmentation/swin/maskformer2_swin_large_IN21k_384_bs16_160k.yaml +18 -0
- mask2former/configs/ade20k/panoptic-segmentation/Base-ADE20K-PanopticSegmentation.yaml +61 -0
- mask2former/configs/ade20k/panoptic-segmentation/maskformer2_R50_bs16_160k.yaml +44 -0
- mask2former/configs/ade20k/panoptic-segmentation/swin/maskformer2_swin_large_IN21k_384_bs16_160k.yaml +18 -0
- mask2former/configs/ade20k/semantic-segmentation/Base-ADE20K-SemanticSegmentation.yaml +61 -0
- mask2former/configs/ade20k/semantic-segmentation/maskformer2_R101_bs16_90k.yaml +11 -0
- mask2former/configs/ade20k/semantic-segmentation/maskformer2_R50_bs16_160k.yaml +44 -0
- mask2former/configs/ade20k/semantic-segmentation/swin/maskformer2_swin_base_384_bs16_160k_res640.yaml +37 -0
- mask2former/configs/ade20k/semantic-segmentation/swin/maskformer2_swin_base_IN21k_384_bs16_160k_res640.yaml +37 -0
- mask2former/configs/ade20k/semantic-segmentation/swin/maskformer2_swin_large_IN21k_384_bs16_160k_res640.yaml +37 -0
- mask2former/configs/ade20k/semantic-segmentation/swin/maskformer2_swin_small_bs16_160k.yaml +15 -0
- mask2former/configs/ade20k/semantic-segmentation/swin/maskformer2_swin_tiny_bs16_160k.yaml +15 -0
- mask2former/configs/cityscapes/instance-segmentation/Base-Cityscapes-InstanceSegmentation.yaml +61 -0
- mask2former/configs/cityscapes/instance-segmentation/maskformer2_R101_bs16_90k.yaml +11 -0
- mask2former/configs/cityscapes/instance-segmentation/maskformer2_R50_bs16_90k.yaml +44 -0
- mask2former/configs/cityscapes/instance-segmentation/swin/maskformer2_swin_base_IN21k_384_bs16_90k.yaml +16 -0
- mask2former/configs/cityscapes/instance-segmentation/swin/maskformer2_swin_large_IN21k_384_bs16_90k.yaml +18 -0
- mask2former/configs/cityscapes/instance-segmentation/swin/maskformer2_swin_small_bs16_90k.yaml +15 -0
- mask2former/configs/cityscapes/instance-segmentation/swin/maskformer2_swin_tiny_bs16_90k.yaml +15 -0
- mask2former/configs/cityscapes/panoptic-segmentation/Base-Cityscapes-PanopticSegmentation.yaml +61 -0
- mask2former/configs/cityscapes/panoptic-segmentation/maskformer2_R101_bs16_90k.yaml +11 -0
- mask2former/configs/cityscapes/panoptic-segmentation/maskformer2_R50_bs16_90k.yaml +44 -0
- mask2former/configs/cityscapes/panoptic-segmentation/swin/maskformer2_swin_base_IN21k_384_bs16_90k.yaml +16 -0
- mask2former/configs/cityscapes/panoptic-segmentation/swin/maskformer2_swin_large_IN21k_384_bs16_90k.yaml +18 -0
- mask2former/configs/cityscapes/panoptic-segmentation/swin/maskformer2_swin_small_bs16_90k.yaml +15 -0
- mask2former/configs/cityscapes/panoptic-segmentation/swin/maskformer2_swin_tiny_bs16_90k.yaml +15 -0
- mask2former/configs/cityscapes/semantic-segmentation/Base-Cityscapes-SemanticSegmentation.yaml +61 -0
- mask2former/configs/cityscapes/semantic-segmentation/maskformer2_R101_bs16_90k.yaml +11 -0
- mask2former/configs/cityscapes/semantic-segmentation/maskformer2_R50_bs16_90k.yaml +44 -0
- mask2former/configs/cityscapes/semantic-segmentation/swin/maskformer2_swin_base_IN21k_384_bs16_90k.yaml +16 -0
- mask2former/configs/cityscapes/semantic-segmentation/swin/maskformer2_swin_large_IN21k_384_bs16_90k.yaml +18 -0
- mask2former/configs/cityscapes/semantic-segmentation/swin/maskformer2_swin_small_bs16_90k.yaml +15 -0
- mask2former/configs/cityscapes/semantic-segmentation/swin/maskformer2_swin_tiny_bs16_90k.yaml +15 -0
- mask2former/configs/coco/instance-segmentation/Base-COCO-InstanceSegmentation.yaml +47 -0
- mask2former/configs/coco/instance-segmentation/maskformer2_R101_bs16_50ep.yaml +11 -0
.gitignore
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
.DS_Store
|
FINETUNING.md
ADDED
@@ -0,0 +1,126 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Fine-tuning
|
2 |
+
|
3 |
+
We provide fine-tuning scripts for classification, semantic segmentation, depth estimation and more.
|
4 |
+
Please check [SETUP.md](SETUP.md) for set-up instructions first.
|
5 |
+
|
6 |
+
- [General information](#general-information)
|
7 |
+
- [Classification](#classification)
|
8 |
+
- [Semantic segmentation](#semantic-segmentation)
|
9 |
+
- [Depth estimation](#depth-estimation)
|
10 |
+
- [Taskonomy tasks](#taskonomy-tasks)
|
11 |
+
|
12 |
+
## General information
|
13 |
+
|
14 |
+
### Loading pre-trained models
|
15 |
+
|
16 |
+
All our fine-tuning scripts support models in the MultiMAE / MultiViT format. Pre-trained models using the timm / ViT format can be converted to this format using the [`vit2multimae_converter.py`](tools/vit2multimae_converter.py)
|
17 |
+
script. More information can be found [here](README.md#model-formats).
|
18 |
+
|
19 |
+
### Modifying configs
|
20 |
+
The training scripts support both YAML config files and command-line arguments. See [here](cfgs/finetune) for all fine-tuning config files.
|
21 |
+
|
22 |
+
To modify fine-training settings, either edit / add config files or provide additional command-line arguments.
|
23 |
+
|
24 |
+
:information_source: Config files arguments override default arguments, and command-line arguments override both default arguments and config arguments.
|
25 |
+
|
26 |
+
:warning: When changing settings (e.g., using a different pre-trained model), make sure to modify the `output_dir` and `wandb_run_name` (if logging is activated) to reflect the changes.
|
27 |
+
|
28 |
+
|
29 |
+
### Experiment logging
|
30 |
+
To activate logging to [Weights & Biases](https://docs.wandb.ai/), either edit the config files or use the `--log_wandb` flag along with any other extra logging arguments.
|
31 |
+
|
32 |
+
|
33 |
+
## Classification
|
34 |
+
|
35 |
+
We use 8 A100 GPUs for classification fine-tuning. Configs can be found [here](cfgs/finetune/cls).
|
36 |
+
|
37 |
+
To fine-tune MultiMAE on ImageNet-1K classification using default settings, run:
|
38 |
+
```bash
|
39 |
+
OMP_NUM_THREADS=1 torchrun --nproc_per_node=8 run_finetuning_cls.py \
|
40 |
+
--config cfgs/finetune/cls/ft_in1k_100e_multimae-b.yaml \
|
41 |
+
--finetune /path/to/multimae_weights \
|
42 |
+
--data_path /path/to/in1k/train/rgb \
|
43 |
+
--eval_data_path /path/to/in1k/val/rgb
|
44 |
+
```
|
45 |
+
|
46 |
+
- For a list of possible arguments, see [`run_finetuning_cls.py`](run_finetuning_cls.py).
|
47 |
+
|
48 |
+
## Semantic segmentation
|
49 |
+
|
50 |
+
We use 4 A100 GPUs for semantic segmentation fine-tuning. Configs can be found [here](cfgs/finetune/semseg).
|
51 |
+
|
52 |
+
### ADE20K
|
53 |
+
To fine-tune MultiMAE on ADE20K semantic segmentation with default settings and **RGB** as the input modality, run:
|
54 |
+
```bash
|
55 |
+
OMP_NUM_THREADS=1 torchrun --nproc_per_node=4 run_finetuning_semseg.py \
|
56 |
+
--config cfgs/finetune/semseg/ade/ft_ade_64e_multimae-b_rgb.yaml \
|
57 |
+
--finetune /path/to/multimae_weights \
|
58 |
+
--data_path /path/to/ade20k/train \
|
59 |
+
--eval_data_path /path/to/ade20k/val
|
60 |
+
```
|
61 |
+
|
62 |
+
- For a list of possible arguments, see [`run_finetuning_semseg.py`](run_finetuning_semseg.py).
|
63 |
+
|
64 |
+
|
65 |
+
### Hypersim
|
66 |
+
To fine-tune MultiMAE on Hypersim semantic segmentation with default settings and **RGB** as the input modality, run:
|
67 |
+
```bash
|
68 |
+
OMP_NUM_THREADS=1 torchrun --nproc_per_node=4 run_finetuning_semseg.py \
|
69 |
+
--config cfgs/finetune/semseg/hypersim/ft_hypersim_25e_multimae-b_rgb.yaml \
|
70 |
+
--finetune /path/to/multimae_weights \
|
71 |
+
--data_path /path/to/hypersim/train \
|
72 |
+
--eval_data_path /path/to/hypersim/val
|
73 |
+
```
|
74 |
+
|
75 |
+
- To fine-tune using **depth-only** and **RGB + depth** as the input modalities, simply swap the config file to the appropriate one.
|
76 |
+
- For a list of possible arguments, see [`run_finetuning_semseg.py`](run_finetuning_semseg.py).
|
77 |
+
|
78 |
+
|
79 |
+
|
80 |
+
### NYUv2
|
81 |
+
To fine-tune MultiMAE on NYUv2 semantic segmentation with default settings and **RGB** as the input modality, run:
|
82 |
+
```bash
|
83 |
+
OMP_NUM_THREADS=1 torchrun --nproc_per_node=4 run_finetuning_semseg.py \
|
84 |
+
--config cfgs/finetune/semseg/nyu/ft_nyu_200e_multimae-b_rgb.yaml \
|
85 |
+
--finetune /path/to/multimae_weights \
|
86 |
+
--data_path /path/to/nyu/train \
|
87 |
+
--eval_data_path /path/to/nyu/test_or_val
|
88 |
+
```
|
89 |
+
|
90 |
+
- To fine-tune using **depth-only** and **RGB + depth** as the input modalities, simply swap the config file to the appropriate one.
|
91 |
+
- For a list of possible arguments, see [`run_finetuning_semseg.py`](run_finetuning_semseg.py).
|
92 |
+
|
93 |
+
|
94 |
+
## Depth estimation
|
95 |
+
|
96 |
+
We use 2 A100 GPUs for depth estimation fine-tuning. Configs can be found [here](cfgs/finetune/depth).
|
97 |
+
|
98 |
+
|
99 |
+
To fine-tune MultiMAE on NYUv2 depth estimation with default settings, run:
|
100 |
+
```bash
|
101 |
+
OMP_NUM_THREADS=1 torchrun --nproc_per_node=2 run_finetuning_depth.py \
|
102 |
+
--config cfgs/finetune/depth/ft_nyu_2000e_multimae-b.yaml \
|
103 |
+
--finetune /path/to/multimae_weights \
|
104 |
+
--data_path /path/to/nyu/train \
|
105 |
+
--eval_data_path /path/to/nyu/test_or_val
|
106 |
+
```
|
107 |
+
- For a list of possible arguments, see [`run_finetuning_depth.py`](run_finetuning_depth.py).
|
108 |
+
|
109 |
+
## Taskonomy tasks
|
110 |
+
|
111 |
+
We use 1 A100 GPU to fine-tune on Taskonomy tasks. Configs can be found [here](cfgs/finetune/taskonomy).
|
112 |
+
|
113 |
+
The tasks we support are: Principal curvature, z-buffer depth, texture edges, occlusion edges, 2D keypoints,
|
114 |
+
3D keypoints, surface normals, and reshading.
|
115 |
+
|
116 |
+
|
117 |
+
For example, to fine-tune MultiMAE on Taskonomy reshading with default settings, run:
|
118 |
+
```bash
|
119 |
+
OMP_NUM_THREADS=1 torchrun --nproc_per_node=1 run_finetuning_taskonomy.py \
|
120 |
+
--config cfgs/finetune/taskonomy/rgb2reshading-1k/ft_rgb2reshading_multimae-b.yaml \
|
121 |
+
--finetune /path/to/multimae_weights \
|
122 |
+
--data_path /path/to/taskonomy_tiny
|
123 |
+
```
|
124 |
+
|
125 |
+
- To fine-tune on a different task, simply swap the config file to the appropriate one.
|
126 |
+
- For a list of possible arguments, see [`run_finetuning_taskonomy.py`](run_finetuning_taskonomy.py).
|
LICENSE
ADDED
@@ -0,0 +1,399 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
Attribution-NonCommercial 4.0 International
|
2 |
+
|
3 |
+
=======================================================================
|
4 |
+
|
5 |
+
Creative Commons Corporation ("Creative Commons") is not a law firm and
|
6 |
+
does not provide legal services or legal advice. Distribution of
|
7 |
+
Creative Commons public licenses does not create a lawyer-client or
|
8 |
+
other relationship. Creative Commons makes its licenses and related
|
9 |
+
information available on an "as-is" basis. Creative Commons gives no
|
10 |
+
warranties regarding its licenses, any material licensed under their
|
11 |
+
terms and conditions, or any related information. Creative Commons
|
12 |
+
disclaims all liability for damages resulting from their use to the
|
13 |
+
fullest extent possible.
|
14 |
+
|
15 |
+
Using Creative Commons Public Licenses
|
16 |
+
|
17 |
+
Creative Commons public licenses provide a standard set of terms and
|
18 |
+
conditions that creators and other rights holders may use to share
|
19 |
+
original works of authorship and other material subject to copyright
|
20 |
+
and certain other rights specified in the public license below. The
|
21 |
+
following considerations are for informational purposes only, are not
|
22 |
+
exhaustive, and do not form part of our licenses.
|
23 |
+
|
24 |
+
Considerations for licensors: Our public licenses are
|
25 |
+
intended for use by those authorized to give the public
|
26 |
+
permission to use material in ways otherwise restricted by
|
27 |
+
copyright and certain other rights. Our licenses are
|
28 |
+
irrevocable. Licensors should read and understand the terms
|
29 |
+
and conditions of the license they choose before applying it.
|
30 |
+
Licensors should also secure all rights necessary before
|
31 |
+
applying our licenses so that the public can reuse the
|
32 |
+
material as expected. Licensors should clearly mark any
|
33 |
+
material not subject to the license. This includes other CC-
|
34 |
+
licensed material, or material used under an exception or
|
35 |
+
limitation to copyright. More considerations for licensors:
|
36 |
+
wiki.creativecommons.org/Considerations_for_licensors
|
37 |
+
|
38 |
+
Considerations for the public: By using one of our public
|
39 |
+
licenses, a licensor grants the public permission to use the
|
40 |
+
licensed material under specified terms and conditions. If
|
41 |
+
the licensor's permission is not necessary for any reason--for
|
42 |
+
example, because of any applicable exception or limitation to
|
43 |
+
copyright--then that use is not regulated by the license. Our
|
44 |
+
licenses grant only permissions under copyright and certain
|
45 |
+
other rights that a licensor has authority to grant. Use of
|
46 |
+
the licensed material may still be restricted for other
|
47 |
+
reasons, including because others have copyright or other
|
48 |
+
rights in the material. A licensor may make special requests,
|
49 |
+
such as asking that all changes be marked or described.
|
50 |
+
Although not required by our licenses, you are encouraged to
|
51 |
+
respect those requests where reasonable. More_considerations
|
52 |
+
for the public:
|
53 |
+
wiki.creativecommons.org/Considerations_for_licensees
|
54 |
+
|
55 |
+
=======================================================================
|
56 |
+
|
57 |
+
Creative Commons Attribution-NonCommercial 4.0 International Public
|
58 |
+
License
|
59 |
+
|
60 |
+
By exercising the Licensed Rights (defined below), You accept and agree
|
61 |
+
to be bound by the terms and conditions of this Creative Commons
|
62 |
+
Attribution-NonCommercial 4.0 International Public License ("Public
|
63 |
+
License"). To the extent this Public License may be interpreted as a
|
64 |
+
contract, You are granted the Licensed Rights in consideration of Your
|
65 |
+
acceptance of these terms and conditions, and the Licensor grants You
|
66 |
+
such rights in consideration of benefits the Licensor receives from
|
67 |
+
making the Licensed Material available under these terms and
|
68 |
+
conditions.
|
69 |
+
|
70 |
+
Section 1 -- Definitions.
|
71 |
+
|
72 |
+
a. Adapted Material means material subject to Copyright and Similar
|
73 |
+
Rights that is derived from or based upon the Licensed Material
|
74 |
+
and in which the Licensed Material is translated, altered,
|
75 |
+
arranged, transformed, or otherwise modified in a manner requiring
|
76 |
+
permission under the Copyright and Similar Rights held by the
|
77 |
+
Licensor. For purposes of this Public License, where the Licensed
|
78 |
+
Material is a musical work, performance, or sound recording,
|
79 |
+
Adapted Material is always produced where the Licensed Material is
|
80 |
+
synched in timed relation with a moving image.
|
81 |
+
|
82 |
+
b. Adapter's License means the license You apply to Your Copyright
|
83 |
+
and Similar Rights in Your contributions to Adapted Material in
|
84 |
+
accordance with the terms and conditions of this Public License.
|
85 |
+
|
86 |
+
c. Copyright and Similar Rights means copyright and/or similar rights
|
87 |
+
closely related to copyright including, without limitation,
|
88 |
+
performance, broadcast, sound recording, and Sui Generis Database
|
89 |
+
Rights, without regard to how the rights are labeled or
|
90 |
+
categorized. For purposes of this Public License, the rights
|
91 |
+
specified in Section 2(b)(1)-(2) are not Copyright and Similar
|
92 |
+
Rights.
|
93 |
+
d. Effective Technological Measures means those measures that, in the
|
94 |
+
absence of proper authority, may not be circumvented under laws
|
95 |
+
fulfilling obligations under Article 11 of the WIPO Copyright
|
96 |
+
Treaty adopted on December 20, 1996, and/or similar international
|
97 |
+
agreements.
|
98 |
+
|
99 |
+
e. Exceptions and Limitations means fair use, fair dealing, and/or
|
100 |
+
any other exception or limitation to Copyright and Similar Rights
|
101 |
+
that applies to Your use of the Licensed Material.
|
102 |
+
|
103 |
+
f. Licensed Material means the artistic or literary work, database,
|
104 |
+
or other material to which the Licensor applied this Public
|
105 |
+
License.
|
106 |
+
|
107 |
+
g. Licensed Rights means the rights granted to You subject to the
|
108 |
+
terms and conditions of this Public License, which are limited to
|
109 |
+
all Copyright and Similar Rights that apply to Your use of the
|
110 |
+
Licensed Material and that the Licensor has authority to license.
|
111 |
+
|
112 |
+
h. Licensor means the individual(s) or entity(ies) granting rights
|
113 |
+
under this Public License.
|
114 |
+
|
115 |
+
i. NonCommercial means not primarily intended for or directed towards
|
116 |
+
commercial advantage or monetary compensation. For purposes of
|
117 |
+
this Public License, the exchange of the Licensed Material for
|
118 |
+
other material subject to Copyright and Similar Rights by digital
|
119 |
+
file-sharing or similar means is NonCommercial provided there is
|
120 |
+
no payment of monetary compensation in connection with the
|
121 |
+
exchange.
|
122 |
+
|
123 |
+
j. Share means to provide material to the public by any means or
|
124 |
+
process that requires permission under the Licensed Rights, such
|
125 |
+
as reproduction, public display, public performance, distribution,
|
126 |
+
dissemination, communication, or importation, and to make material
|
127 |
+
available to the public including in ways that members of the
|
128 |
+
public may access the material from a place and at a time
|
129 |
+
individually chosen by them.
|
130 |
+
|
131 |
+
k. Sui Generis Database Rights means rights other than copyright
|
132 |
+
resulting from Directive 96/9/EC of the European Parliament and of
|
133 |
+
the Council of 11 March 1996 on the legal protection of databases,
|
134 |
+
as amended and/or succeeded, as well as other essentially
|
135 |
+
equivalent rights anywhere in the world.
|
136 |
+
|
137 |
+
l. You means the individual or entity exercising the Licensed Rights
|
138 |
+
under this Public License. Your has a corresponding meaning.
|
139 |
+
|
140 |
+
Section 2 -- Scope.
|
141 |
+
|
142 |
+
a. License grant.
|
143 |
+
|
144 |
+
1. Subject to the terms and conditions of this Public License,
|
145 |
+
the Licensor hereby grants You a worldwide, royalty-free,
|
146 |
+
non-sublicensable, non-exclusive, irrevocable license to
|
147 |
+
exercise the Licensed Rights in the Licensed Material to:
|
148 |
+
|
149 |
+
a. reproduce and Share the Licensed Material, in whole or
|
150 |
+
in part, for NonCommercial purposes only; and
|
151 |
+
|
152 |
+
b. produce, reproduce, and Share Adapted Material for
|
153 |
+
NonCommercial purposes only.
|
154 |
+
|
155 |
+
2. Exceptions and Limitations. For the avoidance of doubt, where
|
156 |
+
Exceptions and Limitations apply to Your use, this Public
|
157 |
+
License does not apply, and You do not need to comply with
|
158 |
+
its terms and conditions.
|
159 |
+
|
160 |
+
3. Term. The term of this Public License is specified in Section
|
161 |
+
6(a).
|
162 |
+
|
163 |
+
4. Media and formats; technical modifications allowed. The
|
164 |
+
Licensor authorizes You to exercise the Licensed Rights in
|
165 |
+
all media and formats whether now known or hereafter created,
|
166 |
+
and to make technical modifications necessary to do so. The
|
167 |
+
Licensor waives and/or agrees not to assert any right or
|
168 |
+
authority to forbid You from making technical modifications
|
169 |
+
necessary to exercise the Licensed Rights, including
|
170 |
+
technical modifications necessary to circumvent Effective
|
171 |
+
Technological Measures. For purposes of this Public License,
|
172 |
+
simply making modifications authorized by this Section 2(a)
|
173 |
+
(4) never produces Adapted Material.
|
174 |
+
|
175 |
+
5. Downstream recipients.
|
176 |
+
|
177 |
+
a. Offer from the Licensor -- Licensed Material. Every
|
178 |
+
recipient of the Licensed Material automatically
|
179 |
+
receives an offer from the Licensor to exercise the
|
180 |
+
Licensed Rights under the terms and conditions of this
|
181 |
+
Public License.
|
182 |
+
|
183 |
+
b. No downstream restrictions. You may not offer or impose
|
184 |
+
any additional or different terms or conditions on, or
|
185 |
+
apply any Effective Technological Measures to, the
|
186 |
+
Licensed Material if doing so restricts exercise of the
|
187 |
+
Licensed Rights by any recipient of the Licensed
|
188 |
+
Material.
|
189 |
+
|
190 |
+
6. No endorsement. Nothing in this Public License constitutes or
|
191 |
+
may be construed as permission to assert or imply that You
|
192 |
+
are, or that Your use of the Licensed Material is, connected
|
193 |
+
with, or sponsored, endorsed, or granted official status by,
|
194 |
+
the Licensor or others designated to receive attribution as
|
195 |
+
provided in Section 3(a)(1)(A)(i).
|
196 |
+
|
197 |
+
b. Other rights.
|
198 |
+
|
199 |
+
1. Moral rights, such as the right of integrity, are not
|
200 |
+
licensed under this Public License, nor are publicity,
|
201 |
+
privacy, and/or other similar personality rights; however, to
|
202 |
+
the extent possible, the Licensor waives and/or agrees not to
|
203 |
+
assert any such rights held by the Licensor to the limited
|
204 |
+
extent necessary to allow You to exercise the Licensed
|
205 |
+
Rights, but not otherwise.
|
206 |
+
|
207 |
+
2. Patent and trademark rights are not licensed under this
|
208 |
+
Public License.
|
209 |
+
|
210 |
+
3. To the extent possible, the Licensor waives any right to
|
211 |
+
collect royalties from You for the exercise of the Licensed
|
212 |
+
Rights, whether directly or through a collecting society
|
213 |
+
under any voluntary or waivable statutory or compulsory
|
214 |
+
licensing scheme. In all other cases the Licensor expressly
|
215 |
+
reserves any right to collect such royalties, including when
|
216 |
+
the Licensed Material is used other than for NonCommercial
|
217 |
+
purposes.
|
218 |
+
|
219 |
+
Section 3 -- License Conditions.
|
220 |
+
|
221 |
+
Your exercise of the Licensed Rights is expressly made subject to the
|
222 |
+
following conditions.
|
223 |
+
|
224 |
+
a. Attribution.
|
225 |
+
|
226 |
+
1. If You Share the Licensed Material (including in modified
|
227 |
+
form), You must:
|
228 |
+
|
229 |
+
a. retain the following if it is supplied by the Licensor
|
230 |
+
with the Licensed Material:
|
231 |
+
|
232 |
+
i. identification of the creator(s) of the Licensed
|
233 |
+
Material and any others designated to receive
|
234 |
+
attribution, in any reasonable manner requested by
|
235 |
+
the Licensor (including by pseudonym if
|
236 |
+
designated);
|
237 |
+
|
238 |
+
ii. a copyright notice;
|
239 |
+
|
240 |
+
iii. a notice that refers to this Public License;
|
241 |
+
|
242 |
+
iv. a notice that refers to the disclaimer of
|
243 |
+
warranties;
|
244 |
+
|
245 |
+
v. a URI or hyperlink to the Licensed Material to the
|
246 |
+
extent reasonably practicable;
|
247 |
+
|
248 |
+
b. indicate if You modified the Licensed Material and
|
249 |
+
retain an indication of any previous modifications; and
|
250 |
+
|
251 |
+
c. indicate the Licensed Material is licensed under this
|
252 |
+
Public License, and include the text of, or the URI or
|
253 |
+
hyperlink to, this Public License.
|
254 |
+
|
255 |
+
2. You may satisfy the conditions in Section 3(a)(1) in any
|
256 |
+
reasonable manner based on the medium, means, and context in
|
257 |
+
which You Share the Licensed Material. For example, it may be
|
258 |
+
reasonable to satisfy the conditions by providing a URI or
|
259 |
+
hyperlink to a resource that includes the required
|
260 |
+
information.
|
261 |
+
|
262 |
+
3. If requested by the Licensor, You must remove any of the
|
263 |
+
information required by Section 3(a)(1)(A) to the extent
|
264 |
+
reasonably practicable.
|
265 |
+
|
266 |
+
4. If You Share Adapted Material You produce, the Adapter's
|
267 |
+
License You apply must not prevent recipients of the Adapted
|
268 |
+
Material from complying with this Public License.
|
269 |
+
|
270 |
+
Section 4 -- Sui Generis Database Rights.
|
271 |
+
|
272 |
+
Where the Licensed Rights include Sui Generis Database Rights that
|
273 |
+
apply to Your use of the Licensed Material:
|
274 |
+
|
275 |
+
a. for the avoidance of doubt, Section 2(a)(1) grants You the right
|
276 |
+
to extract, reuse, reproduce, and Share all or a substantial
|
277 |
+
portion of the contents of the database for NonCommercial purposes
|
278 |
+
only;
|
279 |
+
|
280 |
+
b. if You include all or a substantial portion of the database
|
281 |
+
contents in a database in which You have Sui Generis Database
|
282 |
+
Rights, then the database in which You have Sui Generis Database
|
283 |
+
Rights (but not its individual contents) is Adapted Material; and
|
284 |
+
|
285 |
+
c. You must comply with the conditions in Section 3(a) if You Share
|
286 |
+
all or a substantial portion of the contents of the database.
|
287 |
+
|
288 |
+
For the avoidance of doubt, this Section 4 supplements and does not
|
289 |
+
replace Your obligations under this Public License where the Licensed
|
290 |
+
Rights include other Copyright and Similar Rights.
|
291 |
+
|
292 |
+
Section 5 -- Disclaimer of Warranties and Limitation of Liability.
|
293 |
+
|
294 |
+
a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE
|
295 |
+
EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS
|
296 |
+
AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF
|
297 |
+
ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS,
|
298 |
+
IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION,
|
299 |
+
WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR
|
300 |
+
PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS,
|
301 |
+
ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT
|
302 |
+
KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT
|
303 |
+
ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU.
|
304 |
+
|
305 |
+
b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE
|
306 |
+
TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION,
|
307 |
+
NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT,
|
308 |
+
INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES,
|
309 |
+
COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR
|
310 |
+
USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN
|
311 |
+
ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR
|
312 |
+
DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR
|
313 |
+
IN PART, THIS LIMITATION MAY NOT APPLY TO YOU.
|
314 |
+
|
315 |
+
c. The disclaimer of warranties and limitation of liability provided
|
316 |
+
above shall be interpreted in a manner that, to the extent
|
317 |
+
possible, most closely approximates an absolute disclaimer and
|
318 |
+
waiver of all liability.
|
319 |
+
|
320 |
+
Section 6 -- Term and Termination.
|
321 |
+
|
322 |
+
a. This Public License applies for the term of the Copyright and
|
323 |
+
Similar Rights licensed here. However, if You fail to comply with
|
324 |
+
this Public License, then Your rights under this Public License
|
325 |
+
terminate automatically.
|
326 |
+
|
327 |
+
b. Where Your right to use the Licensed Material has terminated under
|
328 |
+
Section 6(a), it reinstates:
|
329 |
+
|
330 |
+
1. automatically as of the date the violation is cured, provided
|
331 |
+
it is cured within 30 days of Your discovery of the
|
332 |
+
violation; or
|
333 |
+
|
334 |
+
2. upon express reinstatement by the Licensor.
|
335 |
+
|
336 |
+
For the avoidance of doubt, this Section 6(b) does not affect any
|
337 |
+
right the Licensor may have to seek remedies for Your violations
|
338 |
+
of this Public License.
|
339 |
+
|
340 |
+
c. For the avoidance of doubt, the Licensor may also offer the
|
341 |
+
Licensed Material under separate terms or conditions or stop
|
342 |
+
distributing the Licensed Material at any time; however, doing so
|
343 |
+
will not terminate this Public License.
|
344 |
+
|
345 |
+
d. Sections 1, 5, 6, 7, and 8 survive termination of this Public
|
346 |
+
License.
|
347 |
+
|
348 |
+
Section 7 -- Other Terms and Conditions.
|
349 |
+
|
350 |
+
a. The Licensor shall not be bound by any additional or different
|
351 |
+
terms or conditions communicated by You unless expressly agreed.
|
352 |
+
|
353 |
+
b. Any arrangements, understandings, or agreements regarding the
|
354 |
+
Licensed Material not stated herein are separate from and
|
355 |
+
independent of the terms and conditions of this Public License.
|
356 |
+
|
357 |
+
Section 8 -- Interpretation.
|
358 |
+
|
359 |
+
a. For the avoidance of doubt, this Public License does not, and
|
360 |
+
shall not be interpreted to, reduce, limit, restrict, or impose
|
361 |
+
conditions on any use of the Licensed Material that could lawfully
|
362 |
+
be made without permission under this Public License.
|
363 |
+
|
364 |
+
b. To the extent possible, if any provision of this Public License is
|
365 |
+
deemed unenforceable, it shall be automatically reformed to the
|
366 |
+
minimum extent necessary to make it enforceable. If the provision
|
367 |
+
cannot be reformed, it shall be severed from this Public License
|
368 |
+
without affecting the enforceability of the remaining terms and
|
369 |
+
conditions.
|
370 |
+
|
371 |
+
c. No term or condition of this Public License will be waived and no
|
372 |
+
failure to comply consented to unless expressly agreed to by the
|
373 |
+
Licensor.
|
374 |
+
|
375 |
+
d. Nothing in this Public License constitutes or may be interpreted
|
376 |
+
as a limitation upon, or waiver of, any privileges and immunities
|
377 |
+
that apply to the Licensor or You, including from the legal
|
378 |
+
processes of any jurisdiction or authority.
|
379 |
+
|
380 |
+
=======================================================================
|
381 |
+
|
382 |
+
Creative Commons is not a party to its public
|
383 |
+
licenses. Notwithstanding, Creative Commons may elect to apply one of
|
384 |
+
its public licenses to material it publishes and in those instances
|
385 |
+
will be considered the “Licensor.” The text of the Creative Commons
|
386 |
+
public licenses is dedicated to the public domain under the CC0 Public
|
387 |
+
Domain Dedication. Except for the limited purpose of indicating that
|
388 |
+
material is shared under a Creative Commons public license or as
|
389 |
+
otherwise permitted by the Creative Commons policies published at
|
390 |
+
creativecommons.org/policies, Creative Commons does not authorize the
|
391 |
+
use of the trademark "Creative Commons" or any other trademark or logo
|
392 |
+
of Creative Commons without its prior written consent including,
|
393 |
+
without limitation, in connection with any unauthorized modifications
|
394 |
+
to any of its public licenses or any other arrangements,
|
395 |
+
understandings, or agreements concerning use of licensed material. For
|
396 |
+
the avoidance of doubt, this paragraph does not form part of the
|
397 |
+
public licenses.
|
398 |
+
|
399 |
+
Creative Commons may be contacted at creativecommons.org.
|
app.py
ADDED
@@ -0,0 +1,405 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import sys, os
|
2 |
+
import torch
|
3 |
+
TORCH_VERSION = ".".join(torch.__version__.split(".")[:2])
|
4 |
+
CUDA_VERSION = torch.__version__.split("+")[-1]
|
5 |
+
print("torch: ", TORCH_VERSION, "; cuda: ", CUDA_VERSION)
|
6 |
+
# Install detectron2 that matches the above pytorch version
|
7 |
+
# See https://detectron2.readthedocs.io/tutorials/install.html for instructions
|
8 |
+
os.system(f'pip install detectron2 -f https://dl.fbaipublicfiles.com/detectron2/wheels/{CUDA_VERSION}/torch{TORCH_VERSION}/index.html')
|
9 |
+
os.system("pip install git+https://github.com/cocodataset/panopticapi.git")
|
10 |
+
|
11 |
+
# Imports
|
12 |
+
import gradio as gr
|
13 |
+
import detectron2
|
14 |
+
from detectron2.utils.logger import setup_logger
|
15 |
+
import numpy as np
|
16 |
+
import cv2
|
17 |
+
import torch
|
18 |
+
import torch.nn.functional as F
|
19 |
+
import torchvision.transforms.functional as TF
|
20 |
+
from torchvision import datasets, transforms
|
21 |
+
from einops import rearrange
|
22 |
+
from PIL import Image
|
23 |
+
import imutils
|
24 |
+
import matplotlib.pyplot as plt
|
25 |
+
from mpl_toolkits.axes_grid1 import ImageGrid
|
26 |
+
from tqdm import tqdm
|
27 |
+
import random
|
28 |
+
from functools import partial
|
29 |
+
|
30 |
+
# import some common detectron2 utilities
|
31 |
+
from detectron2 import model_zoo
|
32 |
+
from detectron2.engine import DefaultPredictor
|
33 |
+
from detectron2.config import get_cfg
|
34 |
+
from detectron2.utils.visualizer import Visualizer, ColorMode
|
35 |
+
from detectron2.data import MetadataCatalog
|
36 |
+
from detectron2.projects.deeplab import add_deeplab_config
|
37 |
+
coco_metadata = MetadataCatalog.get("coco_2017_val_panoptic")
|
38 |
+
|
39 |
+
# Import Mask2Former
|
40 |
+
from mask2former import add_maskformer2_config
|
41 |
+
|
42 |
+
# DPT dependencies for depth pseudo labeling
|
43 |
+
from dpt.models import DPTDepthModel
|
44 |
+
|
45 |
+
from multimae.input_adapters import PatchedInputAdapter, SemSegInputAdapter
|
46 |
+
from multimae.output_adapters import SpatialOutputAdapter
|
47 |
+
from multimae.multimae import pretrain_multimae_base
|
48 |
+
from utils.data_constants import IMAGENET_DEFAULT_MEAN, IMAGENET_DEFAULT_STD
|
49 |
+
|
50 |
+
torch.set_grad_enabled(False)
|
51 |
+
device = 'cuda' if torch.cuda.is_available() else 'cpu'
|
52 |
+
print(f'device: {device}')
|
53 |
+
|
54 |
+
|
55 |
+
# Initialize COCO Mask2Former
|
56 |
+
cfg = get_cfg()
|
57 |
+
cfg.MODEL.DEVICE='cpu'
|
58 |
+
add_deeplab_config(cfg)
|
59 |
+
add_maskformer2_config(cfg)
|
60 |
+
cfg.merge_from_file("mask2former/configs/coco/panoptic-segmentation/swin/maskformer2_swin_small_bs16_50ep.yaml")
|
61 |
+
cfg.MODEL.WEIGHTS = 'https://dl.fbaipublicfiles.com/maskformer/mask2former/coco/panoptic/maskformer2_swin_small_bs16_50ep/model_final_a407fd.pkl'
|
62 |
+
cfg.MODEL.MASK_FORMER.TEST.SEMANTIC_ON = True
|
63 |
+
cfg.MODEL.MASK_FORMER.TEST.INSTANCE_ON = True
|
64 |
+
cfg.MODEL.MASK_FORMER.TEST.PANOPTIC_ON = True
|
65 |
+
semseg_model = DefaultPredictor(cfg)
|
66 |
+
|
67 |
+
def predict_semseg(img):
|
68 |
+
return semseg_model(255*img.permute(1,2,0).numpy())['sem_seg'].argmax(0)
|
69 |
+
|
70 |
+
def plot_semseg(img, semseg, ax):
|
71 |
+
v = Visualizer(img.permute(1,2,0), coco_metadata, scale=1.2, instance_mode=ColorMode.IMAGE_BW)
|
72 |
+
semantic_result = v.draw_sem_seg(semseg.cpu()).get_image()
|
73 |
+
ax.imshow(semantic_result)
|
74 |
+
|
75 |
+
|
76 |
+
# Initialize Omnidata depth model
|
77 |
+
os.system("wget https://drive.switch.ch/index.php/s/RFfTZwyKROKKx0l/download")
|
78 |
+
os.system("unzip -j download -d pretrained_models")
|
79 |
+
os.system("rm download")
|
80 |
+
|
81 |
+
omnidata_ckpt = torch.load('./pretrained_models/omnidata_rgb2depth_dpt_hybrid.pth', map_location='cpu')
|
82 |
+
depth_model = DPTDepthModel()
|
83 |
+
depth_model.load_state_dict(omnidata_ckpt)
|
84 |
+
depth_model = depth_model.to(device).eval()
|
85 |
+
|
86 |
+
def predict_depth(img):
|
87 |
+
depth_model_input = (img.unsqueeze(0) - 0.5) / 0.5
|
88 |
+
return depth_model(depth_model_input.to(device))
|
89 |
+
|
90 |
+
|
91 |
+
# MultiMAE model setup
|
92 |
+
DOMAIN_CONF = {
|
93 |
+
'rgb': {
|
94 |
+
'input_adapter': partial(PatchedInputAdapter, num_channels=3, stride_level=1),
|
95 |
+
'output_adapter': partial(SpatialOutputAdapter, num_channels=3, stride_level=1),
|
96 |
+
},
|
97 |
+
'depth': {
|
98 |
+
'input_adapter': partial(PatchedInputAdapter, num_channels=1, stride_level=1),
|
99 |
+
'output_adapter': partial(SpatialOutputAdapter, num_channels=1, stride_level=1),
|
100 |
+
},
|
101 |
+
'semseg': {
|
102 |
+
'input_adapter': partial(SemSegInputAdapter, num_classes=133,
|
103 |
+
dim_class_emb=64, interpolate_class_emb=False, stride_level=4),
|
104 |
+
'output_adapter': partial(SpatialOutputAdapter, num_channels=133, stride_level=4),
|
105 |
+
},
|
106 |
+
}
|
107 |
+
DOMAINS = ['rgb', 'depth', 'semseg']
|
108 |
+
|
109 |
+
input_adapters = {
|
110 |
+
domain: dinfo['input_adapter'](
|
111 |
+
patch_size_full=16,
|
112 |
+
)
|
113 |
+
for domain, dinfo in DOMAIN_CONF.items()
|
114 |
+
}
|
115 |
+
output_adapters = {
|
116 |
+
domain: dinfo['output_adapter'](
|
117 |
+
patch_size_full=16,
|
118 |
+
dim_tokens=256,
|
119 |
+
use_task_queries=True,
|
120 |
+
depth=2,
|
121 |
+
context_tasks=DOMAINS,
|
122 |
+
task=domain
|
123 |
+
)
|
124 |
+
for domain, dinfo in DOMAIN_CONF.items()
|
125 |
+
}
|
126 |
+
|
127 |
+
multimae = pretrain_multimae_base(
|
128 |
+
input_adapters=input_adapters,
|
129 |
+
output_adapters=output_adapters,
|
130 |
+
)
|
131 |
+
|
132 |
+
CKPT_URL = 'https://github.com/EPFL-VILAB/MultiMAE/releases/download/pretrained-weights/multimae-b_98_rgb+-depth-semseg_1600e_multivit-afff3f8c.pth'
|
133 |
+
ckpt = torch.hub.load_state_dict_from_url(CKPT_URL, map_location='cpu')
|
134 |
+
multimae.load_state_dict(ckpt['model'], strict=False)
|
135 |
+
multimae = multimae.to(device).eval()
|
136 |
+
|
137 |
+
|
138 |
+
# Plotting
|
139 |
+
|
140 |
+
def get_masked_image(img, mask, image_size=224, patch_size=16, mask_value=0.0):
|
141 |
+
img_token = rearrange(
|
142 |
+
img.detach().cpu(),
|
143 |
+
'b c (nh ph) (nw pw) -> b (nh nw) (c ph pw)',
|
144 |
+
ph=patch_size, pw=patch_size, nh=image_size//patch_size, nw=image_size//patch_size
|
145 |
+
)
|
146 |
+
img_token[mask.detach().cpu()!=0] = mask_value
|
147 |
+
img = rearrange(
|
148 |
+
img_token,
|
149 |
+
'b (nh nw) (c ph pw) -> b c (nh ph) (nw pw)',
|
150 |
+
ph=patch_size, pw=patch_size, nh=image_size//patch_size, nw=image_size//patch_size
|
151 |
+
)
|
152 |
+
return img
|
153 |
+
|
154 |
+
|
155 |
+
def denormalize(img, mean=IMAGENET_DEFAULT_MEAN, std=IMAGENET_DEFAULT_STD):
|
156 |
+
return TF.normalize(
|
157 |
+
img.clone(),
|
158 |
+
mean= [-m/s for m, s in zip(mean, std)],
|
159 |
+
std= [1/s for s in std]
|
160 |
+
)
|
161 |
+
|
162 |
+
def plot_semseg_gt(input_dict, ax=None, image_size=224):
|
163 |
+
metadata = MetadataCatalog.get("coco_2017_val_panoptic")
|
164 |
+
instance_mode = ColorMode.IMAGE
|
165 |
+
img_viz = 255 * denormalize(input_dict['rgb'].detach().cpu())[0].permute(1,2,0)
|
166 |
+
semseg = F.interpolate(
|
167 |
+
input_dict['semseg'].unsqueeze(0).cpu().float(), size=image_size, mode='nearest'
|
168 |
+
).long()[0,0]
|
169 |
+
visualizer = Visualizer(img_viz, metadata, instance_mode=instance_mode, scale=1)
|
170 |
+
visualizer.draw_sem_seg(semseg)
|
171 |
+
if ax is not None:
|
172 |
+
ax.imshow(visualizer.get_output().get_image())
|
173 |
+
else:
|
174 |
+
return visualizer.get_output().get_image()
|
175 |
+
|
176 |
+
|
177 |
+
def plot_semseg_gt_masked(input_dict, mask, ax=None, mask_value=1.0, image_size=224):
|
178 |
+
img = plot_semseg_gt(input_dict, image_size=image_size)
|
179 |
+
img = torch.LongTensor(img).permute(2,0,1).unsqueeze(0)
|
180 |
+
masked_img = get_masked_image(img.float()/255.0, mask, image_size=image_size, patch_size=16, mask_value=mask_value)
|
181 |
+
masked_img = masked_img[0].permute(1,2,0)
|
182 |
+
|
183 |
+
if ax is not None:
|
184 |
+
ax.imshow(masked_img)
|
185 |
+
else:
|
186 |
+
return masked_img
|
187 |
+
|
188 |
+
|
189 |
+
def get_pred_with_input(gt, pred, mask, image_size=224, patch_size=16):
|
190 |
+
gt_token = rearrange(
|
191 |
+
gt.detach().cpu(),
|
192 |
+
'b c (nh ph) (nw pw) -> b (nh nw) (c ph pw)',
|
193 |
+
ph=patch_size, pw=patch_size, nh=image_size//patch_size, nw=image_size//patch_size
|
194 |
+
)
|
195 |
+
pred_token = rearrange(
|
196 |
+
pred.detach().cpu(),
|
197 |
+
'b c (nh ph) (nw pw) -> b (nh nw) (c ph pw)',
|
198 |
+
ph=patch_size, pw=patch_size, nh=image_size//patch_size, nw=image_size//patch_size
|
199 |
+
)
|
200 |
+
pred_token[mask.detach().cpu()==0] = gt_token[mask.detach().cpu()==0]
|
201 |
+
img = rearrange(
|
202 |
+
pred_token,
|
203 |
+
'b (nh nw) (c ph pw) -> b c (nh ph) (nw pw)',
|
204 |
+
ph=patch_size, pw=patch_size, nh=image_size//patch_size, nw=image_size//patch_size
|
205 |
+
)
|
206 |
+
return img
|
207 |
+
|
208 |
+
|
209 |
+
def plot_semseg_pred_masked(rgb, semseg_preds, semseg_gt, mask, ax=None, image_size=224):
|
210 |
+
metadata = MetadataCatalog.get("coco_2017_val_panoptic")
|
211 |
+
instance_mode = ColorMode.IMAGE
|
212 |
+
img_viz = 255 * denormalize(rgb.detach().cpu())[0].permute(1,2,0)
|
213 |
+
|
214 |
+
semseg = get_pred_with_input(
|
215 |
+
semseg_gt.unsqueeze(1),
|
216 |
+
semseg_preds.argmax(1).unsqueeze(1),
|
217 |
+
mask,
|
218 |
+
image_size=image_size//4,
|
219 |
+
patch_size=4
|
220 |
+
)
|
221 |
+
|
222 |
+
semseg = F.interpolate(semseg.float(), size=image_size, mode='nearest')[0,0].long()
|
223 |
+
|
224 |
+
visualizer = Visualizer(img_viz, metadata, instance_mode=instance_mode, scale=1)
|
225 |
+
visualizer.draw_sem_seg(semseg)
|
226 |
+
if ax is not None:
|
227 |
+
ax.imshow(visualizer.get_output().get_image())
|
228 |
+
else:
|
229 |
+
return visualizer.get_output().get_image()
|
230 |
+
|
231 |
+
def plot_predictions(input_dict, preds, masks, image_size=224):
|
232 |
+
|
233 |
+
masked_rgb = get_masked_image(
|
234 |
+
denormalize(input_dict['rgb']),
|
235 |
+
masks['rgb'],
|
236 |
+
image_size=image_size,
|
237 |
+
mask_value=1.0
|
238 |
+
)[0].permute(1,2,0).detach().cpu()
|
239 |
+
masked_depth = get_masked_image(
|
240 |
+
input_dict['depth'],
|
241 |
+
masks['depth'],
|
242 |
+
image_size=image_size,
|
243 |
+
mask_value=np.nan
|
244 |
+
)[0,0].detach().cpu()
|
245 |
+
|
246 |
+
pred_rgb = denormalize(preds['rgb'])[0].permute(1,2,0).clamp(0,1)
|
247 |
+
pred_depth = preds['depth'][0,0].detach().cpu()
|
248 |
+
|
249 |
+
pred_rgb2 = get_pred_with_input(
|
250 |
+
denormalize(input_dict['rgb']),
|
251 |
+
denormalize(preds['rgb']).clamp(0,1),
|
252 |
+
masks['rgb'],
|
253 |
+
image_size=image_size
|
254 |
+
)[0].permute(1,2,0).detach().cpu()
|
255 |
+
pred_depth2 = get_pred_with_input(
|
256 |
+
input_dict['depth'],
|
257 |
+
preds['depth'],
|
258 |
+
masks['depth'],
|
259 |
+
image_size=image_size
|
260 |
+
)[0,0].detach().cpu()
|
261 |
+
|
262 |
+
fig = plt.figure(figsize=(10, 10))
|
263 |
+
grid = ImageGrid(fig, 111, nrows_ncols=(3, 3), axes_pad=0)
|
264 |
+
|
265 |
+
grid[0].imshow(masked_rgb)
|
266 |
+
grid[1].imshow(pred_rgb2)
|
267 |
+
grid[2].imshow(denormalize(input_dict['rgb'])[0].permute(1,2,0).detach().cpu())
|
268 |
+
|
269 |
+
grid[3].imshow(masked_depth)
|
270 |
+
grid[4].imshow(pred_depth2)
|
271 |
+
grid[5].imshow(input_dict['depth'][0,0].detach().cpu())
|
272 |
+
|
273 |
+
plot_semseg_gt_masked(input_dict, masks['semseg'], grid[6], mask_value=1.0, image_size=image_size)
|
274 |
+
plot_semseg_pred_masked(input_dict['rgb'], preds['semseg'], input_dict['semseg'], masks['semseg'], grid[7], image_size=image_size)
|
275 |
+
plot_semseg_gt(input_dict, grid[8], image_size=image_size)
|
276 |
+
|
277 |
+
for ax in grid:
|
278 |
+
ax.set_xticks([])
|
279 |
+
ax.set_yticks([])
|
280 |
+
|
281 |
+
fontsize = 16
|
282 |
+
grid[0].set_title('Masked inputs', fontsize=fontsize)
|
283 |
+
grid[1].set_title('MultiMAE predictions', fontsize=fontsize)
|
284 |
+
grid[2].set_title('Original Reference', fontsize=fontsize)
|
285 |
+
grid[0].set_ylabel('RGB', fontsize=fontsize)
|
286 |
+
grid[3].set_ylabel('Depth', fontsize=fontsize)
|
287 |
+
grid[6].set_ylabel('Semantic', fontsize=fontsize)
|
288 |
+
|
289 |
+
plt.savefig('./output.png', dpi=300, bbox_inches='tight')
|
290 |
+
plt.close()
|
291 |
+
|
292 |
+
|
293 |
+
def inference(img, num_rgb, num_depth, num_semseg, seed, perform_sampling, alphas, num_tokens):
|
294 |
+
im = Image.open(img)
|
295 |
+
|
296 |
+
# Center crop and resize RGB
|
297 |
+
image_size = 224 # Train resolution
|
298 |
+
img = TF.center_crop(TF.to_tensor(im), min(im.size))
|
299 |
+
img = TF.resize(img, image_size)
|
300 |
+
|
301 |
+
# Predict depth and semseg
|
302 |
+
depth = predict_depth(img)
|
303 |
+
semseg = predict_semseg(img)
|
304 |
+
|
305 |
+
|
306 |
+
# Pre-process RGB, depth and semseg to the MultiMAE input format
|
307 |
+
input_dict = {}
|
308 |
+
|
309 |
+
# Normalize RGB
|
310 |
+
input_dict['rgb'] = TF.normalize(img, mean=IMAGENET_DEFAULT_MEAN, std=IMAGENET_DEFAULT_STD).unsqueeze(0)
|
311 |
+
|
312 |
+
# Normalize depth robustly
|
313 |
+
trunc_depth = torch.sort(depth.flatten())[0]
|
314 |
+
trunc_depth = trunc_depth[int(0.1 * trunc_depth.shape[0]): int(0.9 * trunc_depth.shape[0])]
|
315 |
+
depth = (depth - trunc_depth.mean()[None,None,None]) / torch.sqrt(trunc_depth.var()[None,None,None] + 1e-6)
|
316 |
+
input_dict['depth'] = depth.unsqueeze(0)
|
317 |
+
|
318 |
+
# Downsample semantic segmentation
|
319 |
+
stride = 4
|
320 |
+
semseg = TF.resize(semseg.unsqueeze(0), (semseg.shape[0] // stride, semseg.shape[1] // stride), interpolation=TF.InterpolationMode.NEAREST)
|
321 |
+
input_dict['semseg'] = semseg
|
322 |
+
|
323 |
+
# To GPU
|
324 |
+
input_dict = {k: v.to(device) for k,v in input_dict.items()}
|
325 |
+
|
326 |
+
|
327 |
+
torch.manual_seed(int(seed)) # change seed to resample new mask
|
328 |
+
|
329 |
+
if perform_sampling:
|
330 |
+
# Randomly sample masks
|
331 |
+
|
332 |
+
alphas = min(10000.0, max(0.00001, float(alphas))) # Clamp alphas to reasonable range
|
333 |
+
|
334 |
+
preds, masks = multimae.forward(
|
335 |
+
input_dict,
|
336 |
+
mask_inputs=True, # True if forward pass should sample random masks
|
337 |
+
num_encoded_tokens=num_tokens,
|
338 |
+
alphas=alphas
|
339 |
+
)
|
340 |
+
else:
|
341 |
+
# Randomly sample masks using the specified number of tokens per modality
|
342 |
+
task_masks = {domain: torch.ones(1,196).long().to(device) for domain in DOMAINS}
|
343 |
+
selected_rgb_idxs = torch.randperm(196)[:num_rgb]
|
344 |
+
selected_depth_idxs = torch.randperm(196)[:num_depth]
|
345 |
+
selected_semseg_idxs = torch.randperm(196)[:num_semseg]
|
346 |
+
task_masks['rgb'][:,selected_rgb_idxs] = 0
|
347 |
+
task_masks['depth'][:,selected_depth_idxs] = 0
|
348 |
+
task_masks['semseg'][:,selected_semseg_idxs] = 0
|
349 |
+
|
350 |
+
preds, masks = multimae.forward(
|
351 |
+
input_dict,
|
352 |
+
mask_inputs=True,
|
353 |
+
task_masks=task_masks
|
354 |
+
)
|
355 |
+
|
356 |
+
preds = {domain: pred.detach().cpu() for domain, pred in preds.items()}
|
357 |
+
masks = {domain: mask.detach().cpu() for domain, mask in masks.items()}
|
358 |
+
|
359 |
+
plot_predictions(input_dict, preds, masks)
|
360 |
+
|
361 |
+
return 'output.png'
|
362 |
+
|
363 |
+
|
364 |
+
title = "MultiMAE"
|
365 |
+
description = "Gradio demo for MultiMAE: Multi-modal Multi-task Masked Autoencoders. \
|
366 |
+
Upload your own images or try one of the examples below to explore the multi-modal masked reconstruction of a pre-trained MultiMAE model. \
|
367 |
+
Uploaded images are pseudo labeled using a DPT trained on Omnidata depth, and a Mask2Former trained on COCO. \
|
368 |
+
Choose the number of visible tokens using the sliders below (or sample them randomly) and see how MultiMAE reconstructs the modalities!"
|
369 |
+
|
370 |
+
article = "<p style='text-align: center'><a href='https://arxiv.org/abs/2204.01678' \
|
371 |
+
target='_blank'>MultiMAE: Multi-modal Multi-task Masked Autoencoders</a> | \
|
372 |
+
<a href='https://github.com/EPFL-VILAB/MultiMAE' target='_blank'>Github Repo</a></p>"
|
373 |
+
|
374 |
+
css = '.output-image{height: 713px !important}'
|
375 |
+
|
376 |
+
# Example images
|
377 |
+
os.system("wget https://i.imgur.com/c9ObJdK.jpg")
|
378 |
+
examples = [['c9ObJdK.jpg', 32, 32, 32, 0, True, 1.0, 98]]
|
379 |
+
|
380 |
+
gr.Interface(
|
381 |
+
fn=inference,
|
382 |
+
inputs=[
|
383 |
+
gr.inputs.Image(label='RGB input image', type='filepath'),
|
384 |
+
gr.inputs.Slider(label='Number of RGB input tokens', default=32, step=1, minimum=0, maximum=196),
|
385 |
+
gr.inputs.Slider(label='Number of depth input tokens', default=32, step=1, minimum=0, maximum=196),
|
386 |
+
gr.inputs.Slider(label='Number of semantic input tokens', default=32, step=1, minimum=0, maximum=196),
|
387 |
+
gr.inputs.Number(label='Random seed: Change this to sample different masks', default=0),
|
388 |
+
gr.inputs.Checkbox(label='Randomize the number of tokens: Check this to ignore the above sliders and randomly sample the number \
|
389 |
+
of tokens per modality using the parameters below', default=False),
|
390 |
+
gr.inputs.Slider(label='Symmetric Dirichlet concentration parameter (α > 0). Low values (α << 1.0) result in a sampling behavior, \
|
391 |
+
where most of the time, all visible tokens will be sampled from a single modality. High values \
|
392 |
+
(α >> 1.0) result in similar numbers of tokens being sampled for each modality. α = 1.0 is equivalent \
|
393 |
+
to uniform sampling over the simplex and contains both previous cases and everything in between.',
|
394 |
+
default=1.0, step=0.1, minimum=0.1, maximum=5.0),
|
395 |
+
gr.inputs.Slider(label='Number of input tokens', default=98, step=1, minimum=0, maximum=588),
|
396 |
+
],
|
397 |
+
outputs=[
|
398 |
+
gr.outputs.Image(label='MultiMAE predictions', type='file')
|
399 |
+
],
|
400 |
+
css=css,
|
401 |
+
title=title,
|
402 |
+
description=description,
|
403 |
+
article=article,
|
404 |
+
examples=examples
|
405 |
+
).launch(enable_queue=True, cache_examples=True)
|
dpt/__init__.py
ADDED
File without changes
|
dpt/base_model.py
ADDED
@@ -0,0 +1,16 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import torch
|
2 |
+
|
3 |
+
|
4 |
+
class BaseModel(torch.nn.Module):
|
5 |
+
def load(self, path):
|
6 |
+
"""Load model from file.
|
7 |
+
|
8 |
+
Args:
|
9 |
+
path (str): file path
|
10 |
+
"""
|
11 |
+
parameters = torch.load(path, map_location=torch.device("cpu"))
|
12 |
+
|
13 |
+
if "optimizer" in parameters:
|
14 |
+
parameters = parameters["model"]
|
15 |
+
|
16 |
+
self.load_state_dict(parameters)
|
dpt/blocks.py
ADDED
@@ -0,0 +1,383 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import torch
|
2 |
+
import torch.nn as nn
|
3 |
+
|
4 |
+
from .vit import (
|
5 |
+
_make_pretrained_vitb_rn50_384,
|
6 |
+
_make_pretrained_vitl16_384,
|
7 |
+
_make_pretrained_vitb16_384,
|
8 |
+
forward_vit,
|
9 |
+
)
|
10 |
+
|
11 |
+
|
12 |
+
def _make_encoder(
|
13 |
+
backbone,
|
14 |
+
features,
|
15 |
+
use_pretrained,
|
16 |
+
groups=1,
|
17 |
+
expand=False,
|
18 |
+
exportable=True,
|
19 |
+
hooks=None,
|
20 |
+
use_vit_only=False,
|
21 |
+
use_readout="ignore",
|
22 |
+
enable_attention_hooks=False,
|
23 |
+
):
|
24 |
+
if backbone == "vitl16_384":
|
25 |
+
pretrained = _make_pretrained_vitl16_384(
|
26 |
+
use_pretrained,
|
27 |
+
hooks=hooks,
|
28 |
+
use_readout=use_readout,
|
29 |
+
enable_attention_hooks=enable_attention_hooks,
|
30 |
+
)
|
31 |
+
scratch = _make_scratch(
|
32 |
+
[256, 512, 1024, 1024], features, groups=groups, expand=expand
|
33 |
+
) # ViT-L/16 - 85.0% Top1 (backbone)
|
34 |
+
elif backbone == "vitb_rn50_384":
|
35 |
+
pretrained = _make_pretrained_vitb_rn50_384(
|
36 |
+
use_pretrained,
|
37 |
+
hooks=hooks,
|
38 |
+
use_vit_only=use_vit_only,
|
39 |
+
use_readout=use_readout,
|
40 |
+
enable_attention_hooks=enable_attention_hooks,
|
41 |
+
)
|
42 |
+
scratch = _make_scratch(
|
43 |
+
[256, 512, 768, 768], features, groups=groups, expand=expand
|
44 |
+
) # ViT-H/16 - 85.0% Top1 (backbone)
|
45 |
+
elif backbone == "vitb16_384":
|
46 |
+
pretrained = _make_pretrained_vitb16_384(
|
47 |
+
use_pretrained,
|
48 |
+
hooks=hooks,
|
49 |
+
use_readout=use_readout,
|
50 |
+
enable_attention_hooks=enable_attention_hooks,
|
51 |
+
)
|
52 |
+
scratch = _make_scratch(
|
53 |
+
[96, 192, 384, 768], features, groups=groups, expand=expand
|
54 |
+
) # ViT-B/16 - 84.6% Top1 (backbone)
|
55 |
+
elif backbone == "resnext101_wsl":
|
56 |
+
pretrained = _make_pretrained_resnext101_wsl(use_pretrained)
|
57 |
+
scratch = _make_scratch(
|
58 |
+
[256, 512, 1024, 2048], features, groups=groups, expand=expand
|
59 |
+
) # efficientnet_lite3
|
60 |
+
else:
|
61 |
+
print(f"Backbone '{backbone}' not implemented")
|
62 |
+
assert False
|
63 |
+
|
64 |
+
return pretrained, scratch
|
65 |
+
|
66 |
+
|
67 |
+
def _make_scratch(in_shape, out_shape, groups=1, expand=False):
|
68 |
+
scratch = nn.Module()
|
69 |
+
|
70 |
+
out_shape1 = out_shape
|
71 |
+
out_shape2 = out_shape
|
72 |
+
out_shape3 = out_shape
|
73 |
+
out_shape4 = out_shape
|
74 |
+
if expand == True:
|
75 |
+
out_shape1 = out_shape
|
76 |
+
out_shape2 = out_shape * 2
|
77 |
+
out_shape3 = out_shape * 4
|
78 |
+
out_shape4 = out_shape * 8
|
79 |
+
|
80 |
+
scratch.layer1_rn = nn.Conv2d(
|
81 |
+
in_shape[0],
|
82 |
+
out_shape1,
|
83 |
+
kernel_size=3,
|
84 |
+
stride=1,
|
85 |
+
padding=1,
|
86 |
+
bias=False,
|
87 |
+
groups=groups,
|
88 |
+
)
|
89 |
+
scratch.layer2_rn = nn.Conv2d(
|
90 |
+
in_shape[1],
|
91 |
+
out_shape2,
|
92 |
+
kernel_size=3,
|
93 |
+
stride=1,
|
94 |
+
padding=1,
|
95 |
+
bias=False,
|
96 |
+
groups=groups,
|
97 |
+
)
|
98 |
+
scratch.layer3_rn = nn.Conv2d(
|
99 |
+
in_shape[2],
|
100 |
+
out_shape3,
|
101 |
+
kernel_size=3,
|
102 |
+
stride=1,
|
103 |
+
padding=1,
|
104 |
+
bias=False,
|
105 |
+
groups=groups,
|
106 |
+
)
|
107 |
+
scratch.layer4_rn = nn.Conv2d(
|
108 |
+
in_shape[3],
|
109 |
+
out_shape4,
|
110 |
+
kernel_size=3,
|
111 |
+
stride=1,
|
112 |
+
padding=1,
|
113 |
+
bias=False,
|
114 |
+
groups=groups,
|
115 |
+
)
|
116 |
+
|
117 |
+
return scratch
|
118 |
+
|
119 |
+
|
120 |
+
def _make_resnet_backbone(resnet):
|
121 |
+
pretrained = nn.Module()
|
122 |
+
pretrained.layer1 = nn.Sequential(
|
123 |
+
resnet.conv1, resnet.bn1, resnet.relu, resnet.maxpool, resnet.layer1
|
124 |
+
)
|
125 |
+
|
126 |
+
pretrained.layer2 = resnet.layer2
|
127 |
+
pretrained.layer3 = resnet.layer3
|
128 |
+
pretrained.layer4 = resnet.layer4
|
129 |
+
|
130 |
+
return pretrained
|
131 |
+
|
132 |
+
|
133 |
+
def _make_pretrained_resnext101_wsl(use_pretrained):
|
134 |
+
resnet = torch.hub.load("facebookresearch/WSL-Images", "resnext101_32x8d_wsl")
|
135 |
+
return _make_resnet_backbone(resnet)
|
136 |
+
|
137 |
+
|
138 |
+
class Interpolate(nn.Module):
|
139 |
+
"""Interpolation module."""
|
140 |
+
|
141 |
+
def __init__(self, scale_factor, mode, align_corners=False):
|
142 |
+
"""Init.
|
143 |
+
|
144 |
+
Args:
|
145 |
+
scale_factor (float): scaling
|
146 |
+
mode (str): interpolation mode
|
147 |
+
"""
|
148 |
+
super(Interpolate, self).__init__()
|
149 |
+
|
150 |
+
self.interp = nn.functional.interpolate
|
151 |
+
self.scale_factor = scale_factor
|
152 |
+
self.mode = mode
|
153 |
+
self.align_corners = align_corners
|
154 |
+
|
155 |
+
def forward(self, x):
|
156 |
+
"""Forward pass.
|
157 |
+
|
158 |
+
Args:
|
159 |
+
x (tensor): input
|
160 |
+
|
161 |
+
Returns:
|
162 |
+
tensor: interpolated data
|
163 |
+
"""
|
164 |
+
|
165 |
+
x = self.interp(
|
166 |
+
x,
|
167 |
+
scale_factor=self.scale_factor,
|
168 |
+
mode=self.mode,
|
169 |
+
align_corners=self.align_corners,
|
170 |
+
)
|
171 |
+
|
172 |
+
return x
|
173 |
+
|
174 |
+
|
175 |
+
class ResidualConvUnit(nn.Module):
|
176 |
+
"""Residual convolution module."""
|
177 |
+
|
178 |
+
def __init__(self, features):
|
179 |
+
"""Init.
|
180 |
+
|
181 |
+
Args:
|
182 |
+
features (int): number of features
|
183 |
+
"""
|
184 |
+
super().__init__()
|
185 |
+
|
186 |
+
self.conv1 = nn.Conv2d(
|
187 |
+
features, features, kernel_size=3, stride=1, padding=1, bias=True
|
188 |
+
)
|
189 |
+
|
190 |
+
self.conv2 = nn.Conv2d(
|
191 |
+
features, features, kernel_size=3, stride=1, padding=1, bias=True
|
192 |
+
)
|
193 |
+
|
194 |
+
self.relu = nn.ReLU(inplace=True)
|
195 |
+
|
196 |
+
def forward(self, x):
|
197 |
+
"""Forward pass.
|
198 |
+
|
199 |
+
Args:
|
200 |
+
x (tensor): input
|
201 |
+
|
202 |
+
Returns:
|
203 |
+
tensor: output
|
204 |
+
"""
|
205 |
+
out = self.relu(x)
|
206 |
+
out = self.conv1(out)
|
207 |
+
out = self.relu(out)
|
208 |
+
out = self.conv2(out)
|
209 |
+
|
210 |
+
return out + x
|
211 |
+
|
212 |
+
|
213 |
+
class FeatureFusionBlock(nn.Module):
|
214 |
+
"""Feature fusion block."""
|
215 |
+
|
216 |
+
def __init__(self, features):
|
217 |
+
"""Init.
|
218 |
+
|
219 |
+
Args:
|
220 |
+
features (int): number of features
|
221 |
+
"""
|
222 |
+
super(FeatureFusionBlock, self).__init__()
|
223 |
+
|
224 |
+
self.resConfUnit1 = ResidualConvUnit(features)
|
225 |
+
self.resConfUnit2 = ResidualConvUnit(features)
|
226 |
+
|
227 |
+
def forward(self, *xs):
|
228 |
+
"""Forward pass.
|
229 |
+
|
230 |
+
Returns:
|
231 |
+
tensor: output
|
232 |
+
"""
|
233 |
+
output = xs[0]
|
234 |
+
|
235 |
+
if len(xs) == 2:
|
236 |
+
output += self.resConfUnit1(xs[1])
|
237 |
+
|
238 |
+
output = self.resConfUnit2(output)
|
239 |
+
|
240 |
+
output = nn.functional.interpolate(
|
241 |
+
output, scale_factor=2, mode="bilinear", align_corners=True
|
242 |
+
)
|
243 |
+
|
244 |
+
return output
|
245 |
+
|
246 |
+
|
247 |
+
class ResidualConvUnit_custom(nn.Module):
|
248 |
+
"""Residual convolution module."""
|
249 |
+
|
250 |
+
def __init__(self, features, activation, bn):
|
251 |
+
"""Init.
|
252 |
+
|
253 |
+
Args:
|
254 |
+
features (int): number of features
|
255 |
+
"""
|
256 |
+
super().__init__()
|
257 |
+
|
258 |
+
self.bn = bn
|
259 |
+
|
260 |
+
self.groups = 1
|
261 |
+
|
262 |
+
self.conv1 = nn.Conv2d(
|
263 |
+
features,
|
264 |
+
features,
|
265 |
+
kernel_size=3,
|
266 |
+
stride=1,
|
267 |
+
padding=1,
|
268 |
+
bias=not self.bn,
|
269 |
+
groups=self.groups,
|
270 |
+
)
|
271 |
+
|
272 |
+
self.conv2 = nn.Conv2d(
|
273 |
+
features,
|
274 |
+
features,
|
275 |
+
kernel_size=3,
|
276 |
+
stride=1,
|
277 |
+
padding=1,
|
278 |
+
bias=not self.bn,
|
279 |
+
groups=self.groups,
|
280 |
+
)
|
281 |
+
|
282 |
+
if self.bn == True:
|
283 |
+
self.bn1 = nn.BatchNorm2d(features)
|
284 |
+
self.bn2 = nn.BatchNorm2d(features)
|
285 |
+
|
286 |
+
self.activation = activation
|
287 |
+
|
288 |
+
self.skip_add = nn.quantized.FloatFunctional()
|
289 |
+
|
290 |
+
def forward(self, x):
|
291 |
+
"""Forward pass.
|
292 |
+
|
293 |
+
Args:
|
294 |
+
x (tensor): input
|
295 |
+
|
296 |
+
Returns:
|
297 |
+
tensor: output
|
298 |
+
"""
|
299 |
+
|
300 |
+
out = self.activation(x)
|
301 |
+
out = self.conv1(out)
|
302 |
+
if self.bn == True:
|
303 |
+
out = self.bn1(out)
|
304 |
+
|
305 |
+
out = self.activation(out)
|
306 |
+
out = self.conv2(out)
|
307 |
+
if self.bn == True:
|
308 |
+
out = self.bn2(out)
|
309 |
+
|
310 |
+
if self.groups > 1:
|
311 |
+
out = self.conv_merge(out)
|
312 |
+
|
313 |
+
return self.skip_add.add(out, x)
|
314 |
+
|
315 |
+
# return out + x
|
316 |
+
|
317 |
+
|
318 |
+
class FeatureFusionBlock_custom(nn.Module):
|
319 |
+
"""Feature fusion block."""
|
320 |
+
|
321 |
+
def __init__(
|
322 |
+
self,
|
323 |
+
features,
|
324 |
+
activation,
|
325 |
+
deconv=False,
|
326 |
+
bn=False,
|
327 |
+
expand=False,
|
328 |
+
align_corners=True,
|
329 |
+
):
|
330 |
+
"""Init.
|
331 |
+
|
332 |
+
Args:
|
333 |
+
features (int): number of features
|
334 |
+
"""
|
335 |
+
super(FeatureFusionBlock_custom, self).__init__()
|
336 |
+
|
337 |
+
self.deconv = deconv
|
338 |
+
self.align_corners = align_corners
|
339 |
+
|
340 |
+
self.groups = 1
|
341 |
+
|
342 |
+
self.expand = expand
|
343 |
+
out_features = features
|
344 |
+
if self.expand == True:
|
345 |
+
out_features = features // 2
|
346 |
+
|
347 |
+
self.out_conv = nn.Conv2d(
|
348 |
+
features,
|
349 |
+
out_features,
|
350 |
+
kernel_size=1,
|
351 |
+
stride=1,
|
352 |
+
padding=0,
|
353 |
+
bias=True,
|
354 |
+
groups=1,
|
355 |
+
)
|
356 |
+
|
357 |
+
self.resConfUnit1 = ResidualConvUnit_custom(features, activation, bn)
|
358 |
+
self.resConfUnit2 = ResidualConvUnit_custom(features, activation, bn)
|
359 |
+
|
360 |
+
self.skip_add = nn.quantized.FloatFunctional()
|
361 |
+
|
362 |
+
def forward(self, *xs):
|
363 |
+
"""Forward pass.
|
364 |
+
|
365 |
+
Returns:
|
366 |
+
tensor: output
|
367 |
+
"""
|
368 |
+
output = xs[0]
|
369 |
+
|
370 |
+
if len(xs) == 2:
|
371 |
+
res = self.resConfUnit1(xs[1])
|
372 |
+
output = self.skip_add.add(output, res)
|
373 |
+
# output += res
|
374 |
+
|
375 |
+
output = self.resConfUnit2(output)
|
376 |
+
|
377 |
+
output = nn.functional.interpolate(
|
378 |
+
output, scale_factor=2, mode="bilinear", align_corners=self.align_corners
|
379 |
+
)
|
380 |
+
|
381 |
+
output = self.out_conv(output)
|
382 |
+
|
383 |
+
return output
|
dpt/midas_net.py
ADDED
@@ -0,0 +1,77 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
"""MidashNet: Network for monocular depth estimation trained by mixing several datasets.
|
2 |
+
This file contains code that is adapted from
|
3 |
+
https://github.com/thomasjpfan/pytorch_refinenet/blob/master/pytorch_refinenet/refinenet/refinenet_4cascade.py
|
4 |
+
"""
|
5 |
+
import torch
|
6 |
+
import torch.nn as nn
|
7 |
+
|
8 |
+
from .base_model import BaseModel
|
9 |
+
from .blocks import FeatureFusionBlock, Interpolate, _make_encoder
|
10 |
+
|
11 |
+
|
12 |
+
class MidasNet_large(BaseModel):
|
13 |
+
"""Network for monocular depth estimation."""
|
14 |
+
|
15 |
+
def __init__(self, path=None, features=256, non_negative=True):
|
16 |
+
"""Init.
|
17 |
+
|
18 |
+
Args:
|
19 |
+
path (str, optional): Path to saved model. Defaults to None.
|
20 |
+
features (int, optional): Number of features. Defaults to 256.
|
21 |
+
backbone (str, optional): Backbone network for encoder. Defaults to resnet50
|
22 |
+
"""
|
23 |
+
print("Loading weights: ", path)
|
24 |
+
|
25 |
+
super(MidasNet_large, self).__init__()
|
26 |
+
|
27 |
+
use_pretrained = False if path is None else True
|
28 |
+
|
29 |
+
self.pretrained, self.scratch = _make_encoder(
|
30 |
+
backbone="resnext101_wsl", features=features, use_pretrained=use_pretrained
|
31 |
+
)
|
32 |
+
|
33 |
+
self.scratch.refinenet4 = FeatureFusionBlock(features)
|
34 |
+
self.scratch.refinenet3 = FeatureFusionBlock(features)
|
35 |
+
self.scratch.refinenet2 = FeatureFusionBlock(features)
|
36 |
+
self.scratch.refinenet1 = FeatureFusionBlock(features)
|
37 |
+
|
38 |
+
self.scratch.output_conv = nn.Sequential(
|
39 |
+
nn.Conv2d(features, 128, kernel_size=3, stride=1, padding=1),
|
40 |
+
Interpolate(scale_factor=2, mode="bilinear"),
|
41 |
+
nn.Conv2d(128, 32, kernel_size=3, stride=1, padding=1),
|
42 |
+
nn.ReLU(True),
|
43 |
+
nn.Conv2d(32, 1, kernel_size=1, stride=1, padding=0),
|
44 |
+
nn.ReLU(True) if non_negative else nn.Identity(),
|
45 |
+
)
|
46 |
+
|
47 |
+
if path:
|
48 |
+
self.load(path)
|
49 |
+
|
50 |
+
def forward(self, x):
|
51 |
+
"""Forward pass.
|
52 |
+
|
53 |
+
Args:
|
54 |
+
x (tensor): input data (image)
|
55 |
+
|
56 |
+
Returns:
|
57 |
+
tensor: depth
|
58 |
+
"""
|
59 |
+
|
60 |
+
layer_1 = self.pretrained.layer1(x)
|
61 |
+
layer_2 = self.pretrained.layer2(layer_1)
|
62 |
+
layer_3 = self.pretrained.layer3(layer_2)
|
63 |
+
layer_4 = self.pretrained.layer4(layer_3)
|
64 |
+
|
65 |
+
layer_1_rn = self.scratch.layer1_rn(layer_1)
|
66 |
+
layer_2_rn = self.scratch.layer2_rn(layer_2)
|
67 |
+
layer_3_rn = self.scratch.layer3_rn(layer_3)
|
68 |
+
layer_4_rn = self.scratch.layer4_rn(layer_4)
|
69 |
+
|
70 |
+
path_4 = self.scratch.refinenet4(layer_4_rn)
|
71 |
+
path_3 = self.scratch.refinenet3(path_4, layer_3_rn)
|
72 |
+
path_2 = self.scratch.refinenet2(path_3, layer_2_rn)
|
73 |
+
path_1 = self.scratch.refinenet1(path_2, layer_1_rn)
|
74 |
+
|
75 |
+
out = self.scratch.output_conv(path_1)
|
76 |
+
|
77 |
+
return torch.squeeze(out, dim=1)
|
dpt/models.py
ADDED
@@ -0,0 +1,153 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import torch
|
2 |
+
import torch.nn as nn
|
3 |
+
import torch.nn.functional as F
|
4 |
+
|
5 |
+
from .base_model import BaseModel
|
6 |
+
from .blocks import (
|
7 |
+
FeatureFusionBlock,
|
8 |
+
FeatureFusionBlock_custom,
|
9 |
+
Interpolate,
|
10 |
+
_make_encoder,
|
11 |
+
forward_vit,
|
12 |
+
)
|
13 |
+
|
14 |
+
|
15 |
+
def _make_fusion_block(features, use_bn):
|
16 |
+
return FeatureFusionBlock_custom(
|
17 |
+
features,
|
18 |
+
nn.ReLU(False),
|
19 |
+
deconv=False,
|
20 |
+
bn=use_bn,
|
21 |
+
expand=False,
|
22 |
+
align_corners=True,
|
23 |
+
)
|
24 |
+
|
25 |
+
|
26 |
+
class DPT(BaseModel):
|
27 |
+
def __init__(
|
28 |
+
self,
|
29 |
+
head,
|
30 |
+
features=256,
|
31 |
+
backbone="vitb_rn50_384",
|
32 |
+
readout="project",
|
33 |
+
channels_last=False,
|
34 |
+
use_bn=False,
|
35 |
+
enable_attention_hooks=False,
|
36 |
+
):
|
37 |
+
|
38 |
+
super(DPT, self).__init__()
|
39 |
+
|
40 |
+
self.channels_last = channels_last
|
41 |
+
|
42 |
+
hooks = {
|
43 |
+
"vitb_rn50_384": [0, 1, 8, 11],
|
44 |
+
"vitb16_384": [2, 5, 8, 11],
|
45 |
+
"vitl16_384": [5, 11, 17, 23],
|
46 |
+
}
|
47 |
+
|
48 |
+
# Instantiate backbone and reassemble blocks
|
49 |
+
self.pretrained, self.scratch = _make_encoder(
|
50 |
+
backbone,
|
51 |
+
features,
|
52 |
+
False, # Set to true of you want to train from scratch, uses ImageNet weights
|
53 |
+
groups=1,
|
54 |
+
expand=False,
|
55 |
+
exportable=False,
|
56 |
+
hooks=hooks[backbone],
|
57 |
+
use_readout=readout,
|
58 |
+
enable_attention_hooks=enable_attention_hooks,
|
59 |
+
)
|
60 |
+
|
61 |
+
self.scratch.refinenet1 = _make_fusion_block(features, use_bn)
|
62 |
+
self.scratch.refinenet2 = _make_fusion_block(features, use_bn)
|
63 |
+
self.scratch.refinenet3 = _make_fusion_block(features, use_bn)
|
64 |
+
self.scratch.refinenet4 = _make_fusion_block(features, use_bn)
|
65 |
+
|
66 |
+
self.scratch.output_conv = head
|
67 |
+
|
68 |
+
def forward(self, x):
|
69 |
+
if self.channels_last == True:
|
70 |
+
x.contiguous(memory_format=torch.channels_last)
|
71 |
+
|
72 |
+
layer_1, layer_2, layer_3, layer_4 = forward_vit(self.pretrained, x)
|
73 |
+
|
74 |
+
layer_1_rn = self.scratch.layer1_rn(layer_1)
|
75 |
+
layer_2_rn = self.scratch.layer2_rn(layer_2)
|
76 |
+
layer_3_rn = self.scratch.layer3_rn(layer_3)
|
77 |
+
layer_4_rn = self.scratch.layer4_rn(layer_4)
|
78 |
+
|
79 |
+
path_4 = self.scratch.refinenet4(layer_4_rn)
|
80 |
+
path_3 = self.scratch.refinenet3(path_4, layer_3_rn)
|
81 |
+
path_2 = self.scratch.refinenet2(path_3, layer_2_rn)
|
82 |
+
path_1 = self.scratch.refinenet1(path_2, layer_1_rn)
|
83 |
+
|
84 |
+
out = self.scratch.output_conv(path_1)
|
85 |
+
|
86 |
+
return out
|
87 |
+
|
88 |
+
|
89 |
+
class DPTDepthModel(DPT):
|
90 |
+
def __init__(
|
91 |
+
self, path=None, non_negative=True, scale=1.0, shift=0.0, invert=False, **kwargs
|
92 |
+
):
|
93 |
+
features = kwargs["features"] if "features" in kwargs else 256
|
94 |
+
|
95 |
+
self.scale = scale
|
96 |
+
self.shift = shift
|
97 |
+
self.invert = invert
|
98 |
+
|
99 |
+
head = nn.Sequential(
|
100 |
+
nn.Conv2d(features, features // 2, kernel_size=3, stride=1, padding=1),
|
101 |
+
Interpolate(scale_factor=2, mode="bilinear", align_corners=True),
|
102 |
+
nn.Conv2d(features // 2, 32, kernel_size=3, stride=1, padding=1),
|
103 |
+
nn.ReLU(True),
|
104 |
+
nn.Conv2d(32, 1, kernel_size=1, stride=1, padding=0),
|
105 |
+
nn.ReLU(True) if non_negative else nn.Identity(),
|
106 |
+
nn.Identity(),
|
107 |
+
)
|
108 |
+
|
109 |
+
super().__init__(head, **kwargs)
|
110 |
+
|
111 |
+
if path is not None:
|
112 |
+
self.load(path)
|
113 |
+
|
114 |
+
def forward(self, x):
|
115 |
+
inv_depth = super().forward(x).squeeze(dim=1)
|
116 |
+
|
117 |
+
if self.invert:
|
118 |
+
depth = self.scale * inv_depth + self.shift
|
119 |
+
depth[depth < 1e-8] = 1e-8
|
120 |
+
depth = 1.0 / depth
|
121 |
+
return depth
|
122 |
+
else:
|
123 |
+
return inv_depth
|
124 |
+
|
125 |
+
|
126 |
+
class DPTSegmentationModel(DPT):
|
127 |
+
def __init__(self, num_classes, path=None, **kwargs):
|
128 |
+
|
129 |
+
features = kwargs["features"] if "features" in kwargs else 256
|
130 |
+
|
131 |
+
kwargs["use_bn"] = True
|
132 |
+
|
133 |
+
head = nn.Sequential(
|
134 |
+
nn.Conv2d(features, features, kernel_size=3, padding=1, bias=False),
|
135 |
+
nn.BatchNorm2d(features),
|
136 |
+
nn.ReLU(True),
|
137 |
+
nn.Dropout(0.1, False),
|
138 |
+
nn.Conv2d(features, num_classes, kernel_size=1),
|
139 |
+
Interpolate(scale_factor=2, mode="bilinear", align_corners=True),
|
140 |
+
)
|
141 |
+
|
142 |
+
super().__init__(head, **kwargs)
|
143 |
+
|
144 |
+
self.auxlayer = nn.Sequential(
|
145 |
+
nn.Conv2d(features, features, kernel_size=3, padding=1, bias=False),
|
146 |
+
nn.BatchNorm2d(features),
|
147 |
+
nn.ReLU(True),
|
148 |
+
nn.Dropout(0.1, False),
|
149 |
+
nn.Conv2d(features, num_classes, kernel_size=1),
|
150 |
+
)
|
151 |
+
|
152 |
+
if path is not None:
|
153 |
+
self.load(path)
|
dpt/transforms.py
ADDED
@@ -0,0 +1,231 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import numpy as np
|
2 |
+
import cv2
|
3 |
+
import math
|
4 |
+
|
5 |
+
|
6 |
+
def apply_min_size(sample, size, image_interpolation_method=cv2.INTER_AREA):
|
7 |
+
"""Rezise the sample to ensure the given size. Keeps aspect ratio.
|
8 |
+
|
9 |
+
Args:
|
10 |
+
sample (dict): sample
|
11 |
+
size (tuple): image size
|
12 |
+
|
13 |
+
Returns:
|
14 |
+
tuple: new size
|
15 |
+
"""
|
16 |
+
shape = list(sample["disparity"].shape)
|
17 |
+
|
18 |
+
if shape[0] >= size[0] and shape[1] >= size[1]:
|
19 |
+
return sample
|
20 |
+
|
21 |
+
scale = [0, 0]
|
22 |
+
scale[0] = size[0] / shape[0]
|
23 |
+
scale[1] = size[1] / shape[1]
|
24 |
+
|
25 |
+
scale = max(scale)
|
26 |
+
|
27 |
+
shape[0] = math.ceil(scale * shape[0])
|
28 |
+
shape[1] = math.ceil(scale * shape[1])
|
29 |
+
|
30 |
+
# resize
|
31 |
+
sample["image"] = cv2.resize(
|
32 |
+
sample["image"], tuple(shape[::-1]), interpolation=image_interpolation_method
|
33 |
+
)
|
34 |
+
|
35 |
+
sample["disparity"] = cv2.resize(
|
36 |
+
sample["disparity"], tuple(shape[::-1]), interpolation=cv2.INTER_NEAREST
|
37 |
+
)
|
38 |
+
sample["mask"] = cv2.resize(
|
39 |
+
sample["mask"].astype(np.float32),
|
40 |
+
tuple(shape[::-1]),
|
41 |
+
interpolation=cv2.INTER_NEAREST,
|
42 |
+
)
|
43 |
+
sample["mask"] = sample["mask"].astype(bool)
|
44 |
+
|
45 |
+
return tuple(shape)
|
46 |
+
|
47 |
+
|
48 |
+
class Resize(object):
|
49 |
+
"""Resize sample to given size (width, height)."""
|
50 |
+
|
51 |
+
def __init__(
|
52 |
+
self,
|
53 |
+
width,
|
54 |
+
height,
|
55 |
+
resize_target=True,
|
56 |
+
keep_aspect_ratio=False,
|
57 |
+
ensure_multiple_of=1,
|
58 |
+
resize_method="lower_bound",
|
59 |
+
image_interpolation_method=cv2.INTER_AREA,
|
60 |
+
):
|
61 |
+
"""Init.
|
62 |
+
|
63 |
+
Args:
|
64 |
+
width (int): desired output width
|
65 |
+
height (int): desired output height
|
66 |
+
resize_target (bool, optional):
|
67 |
+
True: Resize the full sample (image, mask, target).
|
68 |
+
False: Resize image only.
|
69 |
+
Defaults to True.
|
70 |
+
keep_aspect_ratio (bool, optional):
|
71 |
+
True: Keep the aspect ratio of the input sample.
|
72 |
+
Output sample might not have the given width and height, and
|
73 |
+
resize behaviour depends on the parameter 'resize_method'.
|
74 |
+
Defaults to False.
|
75 |
+
ensure_multiple_of (int, optional):
|
76 |
+
Output width and height is constrained to be multiple of this parameter.
|
77 |
+
Defaults to 1.
|
78 |
+
resize_method (str, optional):
|
79 |
+
"lower_bound": Output will be at least as large as the given size.
|
80 |
+
"upper_bound": Output will be at max as large as the given size. (Output size might be smaller than given size.)
|
81 |
+
"minimal": Scale as least as possible. (Output size might be smaller than given size.)
|
82 |
+
Defaults to "lower_bound".
|
83 |
+
"""
|
84 |
+
self.__width = width
|
85 |
+
self.__height = height
|
86 |
+
|
87 |
+
self.__resize_target = resize_target
|
88 |
+
self.__keep_aspect_ratio = keep_aspect_ratio
|
89 |
+
self.__multiple_of = ensure_multiple_of
|
90 |
+
self.__resize_method = resize_method
|
91 |
+
self.__image_interpolation_method = image_interpolation_method
|
92 |
+
|
93 |
+
def constrain_to_multiple_of(self, x, min_val=0, max_val=None):
|
94 |
+
y = (np.round(x / self.__multiple_of) * self.__multiple_of).astype(int)
|
95 |
+
|
96 |
+
if max_val is not None and y > max_val:
|
97 |
+
y = (np.floor(x / self.__multiple_of) * self.__multiple_of).astype(int)
|
98 |
+
|
99 |
+
if y < min_val:
|
100 |
+
y = (np.ceil(x / self.__multiple_of) * self.__multiple_of).astype(int)
|
101 |
+
|
102 |
+
return y
|
103 |
+
|
104 |
+
def get_size(self, width, height):
|
105 |
+
# determine new height and width
|
106 |
+
scale_height = self.__height / height
|
107 |
+
scale_width = self.__width / width
|
108 |
+
|
109 |
+
if self.__keep_aspect_ratio:
|
110 |
+
if self.__resize_method == "lower_bound":
|
111 |
+
# scale such that output size is lower bound
|
112 |
+
if scale_width > scale_height:
|
113 |
+
# fit width
|
114 |
+
scale_height = scale_width
|
115 |
+
else:
|
116 |
+
# fit height
|
117 |
+
scale_width = scale_height
|
118 |
+
elif self.__resize_method == "upper_bound":
|
119 |
+
# scale such that output size is upper bound
|
120 |
+
if scale_width < scale_height:
|
121 |
+
# fit width
|
122 |
+
scale_height = scale_width
|
123 |
+
else:
|
124 |
+
# fit height
|
125 |
+
scale_width = scale_height
|
126 |
+
elif self.__resize_method == "minimal":
|
127 |
+
# scale as least as possbile
|
128 |
+
if abs(1 - scale_width) < abs(1 - scale_height):
|
129 |
+
# fit width
|
130 |
+
scale_height = scale_width
|
131 |
+
else:
|
132 |
+
# fit height
|
133 |
+
scale_width = scale_height
|
134 |
+
else:
|
135 |
+
raise ValueError(
|
136 |
+
f"resize_method {self.__resize_method} not implemented"
|
137 |
+
)
|
138 |
+
|
139 |
+
if self.__resize_method == "lower_bound":
|
140 |
+
new_height = self.constrain_to_multiple_of(
|
141 |
+
scale_height * height, min_val=self.__height
|
142 |
+
)
|
143 |
+
new_width = self.constrain_to_multiple_of(
|
144 |
+
scale_width * width, min_val=self.__width
|
145 |
+
)
|
146 |
+
elif self.__resize_method == "upper_bound":
|
147 |
+
new_height = self.constrain_to_multiple_of(
|
148 |
+
scale_height * height, max_val=self.__height
|
149 |
+
)
|
150 |
+
new_width = self.constrain_to_multiple_of(
|
151 |
+
scale_width * width, max_val=self.__width
|
152 |
+
)
|
153 |
+
elif self.__resize_method == "minimal":
|
154 |
+
new_height = self.constrain_to_multiple_of(scale_height * height)
|
155 |
+
new_width = self.constrain_to_multiple_of(scale_width * width)
|
156 |
+
else:
|
157 |
+
raise ValueError(f"resize_method {self.__resize_method} not implemented")
|
158 |
+
|
159 |
+
return (new_width, new_height)
|
160 |
+
|
161 |
+
def __call__(self, sample):
|
162 |
+
width, height = self.get_size(
|
163 |
+
sample["image"].shape[1], sample["image"].shape[0]
|
164 |
+
)
|
165 |
+
|
166 |
+
# resize sample
|
167 |
+
sample["image"] = cv2.resize(
|
168 |
+
sample["image"],
|
169 |
+
(width, height),
|
170 |
+
interpolation=self.__image_interpolation_method,
|
171 |
+
)
|
172 |
+
|
173 |
+
if self.__resize_target:
|
174 |
+
if "disparity" in sample:
|
175 |
+
sample["disparity"] = cv2.resize(
|
176 |
+
sample["disparity"],
|
177 |
+
(width, height),
|
178 |
+
interpolation=cv2.INTER_NEAREST,
|
179 |
+
)
|
180 |
+
|
181 |
+
if "depth" in sample:
|
182 |
+
sample["depth"] = cv2.resize(
|
183 |
+
sample["depth"], (width, height), interpolation=cv2.INTER_NEAREST
|
184 |
+
)
|
185 |
+
|
186 |
+
sample["mask"] = cv2.resize(
|
187 |
+
sample["mask"].astype(np.float32),
|
188 |
+
(width, height),
|
189 |
+
interpolation=cv2.INTER_NEAREST,
|
190 |
+
)
|
191 |
+
sample["mask"] = sample["mask"].astype(bool)
|
192 |
+
|
193 |
+
return sample
|
194 |
+
|
195 |
+
|
196 |
+
class NormalizeImage(object):
|
197 |
+
"""Normlize image by given mean and std."""
|
198 |
+
|
199 |
+
def __init__(self, mean, std):
|
200 |
+
self.__mean = mean
|
201 |
+
self.__std = std
|
202 |
+
|
203 |
+
def __call__(self, sample):
|
204 |
+
sample["image"] = (sample["image"] - self.__mean) / self.__std
|
205 |
+
|
206 |
+
return sample
|
207 |
+
|
208 |
+
|
209 |
+
class PrepareForNet(object):
|
210 |
+
"""Prepare sample for usage as network input."""
|
211 |
+
|
212 |
+
def __init__(self):
|
213 |
+
pass
|
214 |
+
|
215 |
+
def __call__(self, sample):
|
216 |
+
image = np.transpose(sample["image"], (2, 0, 1))
|
217 |
+
sample["image"] = np.ascontiguousarray(image).astype(np.float32)
|
218 |
+
|
219 |
+
if "mask" in sample:
|
220 |
+
sample["mask"] = sample["mask"].astype(np.float32)
|
221 |
+
sample["mask"] = np.ascontiguousarray(sample["mask"])
|
222 |
+
|
223 |
+
if "disparity" in sample:
|
224 |
+
disparity = sample["disparity"].astype(np.float32)
|
225 |
+
sample["disparity"] = np.ascontiguousarray(disparity)
|
226 |
+
|
227 |
+
if "depth" in sample:
|
228 |
+
depth = sample["depth"].astype(np.float32)
|
229 |
+
sample["depth"] = np.ascontiguousarray(depth)
|
230 |
+
|
231 |
+
return sample
|
dpt/vit.py
ADDED
@@ -0,0 +1,576 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import torch
|
2 |
+
import torch.nn as nn
|
3 |
+
import timm
|
4 |
+
import types
|
5 |
+
import math
|
6 |
+
import torch.nn.functional as F
|
7 |
+
|
8 |
+
|
9 |
+
activations = {}
|
10 |
+
|
11 |
+
|
12 |
+
def get_activation(name):
|
13 |
+
def hook(model, input, output):
|
14 |
+
activations[name] = output
|
15 |
+
|
16 |
+
return hook
|
17 |
+
|
18 |
+
|
19 |
+
attention = {}
|
20 |
+
|
21 |
+
|
22 |
+
def get_attention(name):
|
23 |
+
def hook(module, input, output):
|
24 |
+
x = input[0]
|
25 |
+
B, N, C = x.shape
|
26 |
+
qkv = (
|
27 |
+
module.qkv(x)
|
28 |
+
.reshape(B, N, 3, module.num_heads, C // module.num_heads)
|
29 |
+
.permute(2, 0, 3, 1, 4)
|
30 |
+
)
|
31 |
+
q, k, v = (
|
32 |
+
qkv[0],
|
33 |
+
qkv[1],
|
34 |
+
qkv[2],
|
35 |
+
) # make torchscript happy (cannot use tensor as tuple)
|
36 |
+
|
37 |
+
attn = (q @ k.transpose(-2, -1)) * module.scale
|
38 |
+
|
39 |
+
attn = attn.softmax(dim=-1) # [:,:,1,1:]
|
40 |
+
attention[name] = attn
|
41 |
+
|
42 |
+
return hook
|
43 |
+
|
44 |
+
|
45 |
+
def get_mean_attention_map(attn, token, shape):
|
46 |
+
attn = attn[:, :, token, 1:]
|
47 |
+
attn = attn.unflatten(2, torch.Size([shape[2] // 16, shape[3] // 16])).float()
|
48 |
+
attn = torch.nn.functional.interpolate(
|
49 |
+
attn, size=shape[2:], mode="bicubic", align_corners=False
|
50 |
+
).squeeze(0)
|
51 |
+
|
52 |
+
all_attn = torch.mean(attn, 0)
|
53 |
+
|
54 |
+
return all_attn
|
55 |
+
|
56 |
+
|
57 |
+
class Slice(nn.Module):
|
58 |
+
def __init__(self, start_index=1):
|
59 |
+
super(Slice, self).__init__()
|
60 |
+
self.start_index = start_index
|
61 |
+
|
62 |
+
def forward(self, x):
|
63 |
+
return x[:, self.start_index :]
|
64 |
+
|
65 |
+
|
66 |
+
class AddReadout(nn.Module):
|
67 |
+
def __init__(self, start_index=1):
|
68 |
+
super(AddReadout, self).__init__()
|
69 |
+
self.start_index = start_index
|
70 |
+
|
71 |
+
def forward(self, x):
|
72 |
+
if self.start_index == 2:
|
73 |
+
readout = (x[:, 0] + x[:, 1]) / 2
|
74 |
+
else:
|
75 |
+
readout = x[:, 0]
|
76 |
+
return x[:, self.start_index :] + readout.unsqueeze(1)
|
77 |
+
|
78 |
+
|
79 |
+
class ProjectReadout(nn.Module):
|
80 |
+
def __init__(self, in_features, start_index=1):
|
81 |
+
super(ProjectReadout, self).__init__()
|
82 |
+
self.start_index = start_index
|
83 |
+
|
84 |
+
self.project = nn.Sequential(nn.Linear(2 * in_features, in_features), nn.GELU())
|
85 |
+
|
86 |
+
def forward(self, x):
|
87 |
+
readout = x[:, 0].unsqueeze(1).expand_as(x[:, self.start_index :])
|
88 |
+
features = torch.cat((x[:, self.start_index :], readout), -1)
|
89 |
+
|
90 |
+
return self.project(features)
|
91 |
+
|
92 |
+
|
93 |
+
class Transpose(nn.Module):
|
94 |
+
def __init__(self, dim0, dim1):
|
95 |
+
super(Transpose, self).__init__()
|
96 |
+
self.dim0 = dim0
|
97 |
+
self.dim1 = dim1
|
98 |
+
|
99 |
+
def forward(self, x):
|
100 |
+
x = x.transpose(self.dim0, self.dim1)
|
101 |
+
return x
|
102 |
+
|
103 |
+
|
104 |
+
def forward_vit(pretrained, x):
|
105 |
+
b, c, h, w = x.shape
|
106 |
+
|
107 |
+
glob = pretrained.model.forward_flex(x)
|
108 |
+
|
109 |
+
layer_1 = pretrained.activations["1"]
|
110 |
+
layer_2 = pretrained.activations["2"]
|
111 |
+
layer_3 = pretrained.activations["3"]
|
112 |
+
layer_4 = pretrained.activations["4"]
|
113 |
+
|
114 |
+
layer_1 = pretrained.act_postprocess1[0:2](layer_1)
|
115 |
+
layer_2 = pretrained.act_postprocess2[0:2](layer_2)
|
116 |
+
layer_3 = pretrained.act_postprocess3[0:2](layer_3)
|
117 |
+
layer_4 = pretrained.act_postprocess4[0:2](layer_4)
|
118 |
+
|
119 |
+
unflatten = nn.Sequential(
|
120 |
+
nn.Unflatten(
|
121 |
+
2,
|
122 |
+
torch.Size(
|
123 |
+
[
|
124 |
+
h // pretrained.model.patch_size[1],
|
125 |
+
w // pretrained.model.patch_size[0],
|
126 |
+
]
|
127 |
+
),
|
128 |
+
)
|
129 |
+
)
|
130 |
+
|
131 |
+
if layer_1.ndim == 3:
|
132 |
+
layer_1 = unflatten(layer_1)
|
133 |
+
if layer_2.ndim == 3:
|
134 |
+
layer_2 = unflatten(layer_2)
|
135 |
+
if layer_3.ndim == 3:
|
136 |
+
layer_3 = unflatten(layer_3)
|
137 |
+
if layer_4.ndim == 3:
|
138 |
+
layer_4 = unflatten(layer_4)
|
139 |
+
|
140 |
+
layer_1 = pretrained.act_postprocess1[3 : len(pretrained.act_postprocess1)](layer_1)
|
141 |
+
layer_2 = pretrained.act_postprocess2[3 : len(pretrained.act_postprocess2)](layer_2)
|
142 |
+
layer_3 = pretrained.act_postprocess3[3 : len(pretrained.act_postprocess3)](layer_3)
|
143 |
+
layer_4 = pretrained.act_postprocess4[3 : len(pretrained.act_postprocess4)](layer_4)
|
144 |
+
|
145 |
+
return layer_1, layer_2, layer_3, layer_4
|
146 |
+
|
147 |
+
|
148 |
+
def _resize_pos_embed(self, posemb, gs_h, gs_w):
|
149 |
+
posemb_tok, posemb_grid = (
|
150 |
+
posemb[:, : self.start_index],
|
151 |
+
posemb[0, self.start_index :],
|
152 |
+
)
|
153 |
+
|
154 |
+
gs_old = int(math.sqrt(len(posemb_grid)))
|
155 |
+
|
156 |
+
posemb_grid = posemb_grid.reshape(1, gs_old, gs_old, -1).permute(0, 3, 1, 2)
|
157 |
+
posemb_grid = F.interpolate(posemb_grid, size=(gs_h, gs_w), mode="bilinear")
|
158 |
+
posemb_grid = posemb_grid.permute(0, 2, 3, 1).reshape(1, gs_h * gs_w, -1)
|
159 |
+
|
160 |
+
posemb = torch.cat([posemb_tok, posemb_grid], dim=1)
|
161 |
+
|
162 |
+
return posemb
|
163 |
+
|
164 |
+
|
165 |
+
def forward_flex(self, x):
|
166 |
+
b, c, h, w = x.shape
|
167 |
+
|
168 |
+
pos_embed = self._resize_pos_embed(
|
169 |
+
self.pos_embed, h // self.patch_size[1], w // self.patch_size[0]
|
170 |
+
)
|
171 |
+
|
172 |
+
B = x.shape[0]
|
173 |
+
|
174 |
+
if hasattr(self.patch_embed, "backbone"):
|
175 |
+
x = self.patch_embed.backbone(x)
|
176 |
+
if isinstance(x, (list, tuple)):
|
177 |
+
x = x[-1] # last feature if backbone outputs list/tuple of features
|
178 |
+
|
179 |
+
x = self.patch_embed.proj(x).flatten(2).transpose(1, 2)
|
180 |
+
|
181 |
+
if getattr(self, "dist_token", None) is not None:
|
182 |
+
cls_tokens = self.cls_token.expand(
|
183 |
+
B, -1, -1
|
184 |
+
) # stole cls_tokens impl from Phil Wang, thanks
|
185 |
+
dist_token = self.dist_token.expand(B, -1, -1)
|
186 |
+
x = torch.cat((cls_tokens, dist_token, x), dim=1)
|
187 |
+
else:
|
188 |
+
cls_tokens = self.cls_token.expand(
|
189 |
+
B, -1, -1
|
190 |
+
) # stole cls_tokens impl from Phil Wang, thanks
|
191 |
+
x = torch.cat((cls_tokens, x), dim=1)
|
192 |
+
|
193 |
+
x = x + pos_embed
|
194 |
+
x = self.pos_drop(x)
|
195 |
+
|
196 |
+
for blk in self.blocks:
|
197 |
+
x = blk(x)
|
198 |
+
|
199 |
+
x = self.norm(x)
|
200 |
+
|
201 |
+
return x
|
202 |
+
|
203 |
+
|
204 |
+
def get_readout_oper(vit_features, features, use_readout, start_index=1):
|
205 |
+
if use_readout == "ignore":
|
206 |
+
readout_oper = [Slice(start_index)] * len(features)
|
207 |
+
elif use_readout == "add":
|
208 |
+
readout_oper = [AddReadout(start_index)] * len(features)
|
209 |
+
elif use_readout == "project":
|
210 |
+
readout_oper = [
|
211 |
+
ProjectReadout(vit_features, start_index) for out_feat in features
|
212 |
+
]
|
213 |
+
else:
|
214 |
+
assert (
|
215 |
+
False
|
216 |
+
), "wrong operation for readout token, use_readout can be 'ignore', 'add', or 'project'"
|
217 |
+
|
218 |
+
return readout_oper
|
219 |
+
|
220 |
+
|
221 |
+
def _make_vit_b16_backbone(
|
222 |
+
model,
|
223 |
+
features=[96, 192, 384, 768],
|
224 |
+
size=[384, 384],
|
225 |
+
hooks=[2, 5, 8, 11],
|
226 |
+
vit_features=768,
|
227 |
+
use_readout="ignore",
|
228 |
+
start_index=1,
|
229 |
+
enable_attention_hooks=False,
|
230 |
+
):
|
231 |
+
pretrained = nn.Module()
|
232 |
+
|
233 |
+
pretrained.model = model
|
234 |
+
pretrained.model.blocks[hooks[0]].register_forward_hook(get_activation("1"))
|
235 |
+
pretrained.model.blocks[hooks[1]].register_forward_hook(get_activation("2"))
|
236 |
+
pretrained.model.blocks[hooks[2]].register_forward_hook(get_activation("3"))
|
237 |
+
pretrained.model.blocks[hooks[3]].register_forward_hook(get_activation("4"))
|
238 |
+
|
239 |
+
pretrained.activations = activations
|
240 |
+
|
241 |
+
if enable_attention_hooks:
|
242 |
+
pretrained.model.blocks[hooks[0]].attn.register_forward_hook(
|
243 |
+
get_attention("attn_1")
|
244 |
+
)
|
245 |
+
pretrained.model.blocks[hooks[1]].attn.register_forward_hook(
|
246 |
+
get_attention("attn_2")
|
247 |
+
)
|
248 |
+
pretrained.model.blocks[hooks[2]].attn.register_forward_hook(
|
249 |
+
get_attention("attn_3")
|
250 |
+
)
|
251 |
+
pretrained.model.blocks[hooks[3]].attn.register_forward_hook(
|
252 |
+
get_attention("attn_4")
|
253 |
+
)
|
254 |
+
pretrained.attention = attention
|
255 |
+
|
256 |
+
readout_oper = get_readout_oper(vit_features, features, use_readout, start_index)
|
257 |
+
|
258 |
+
# 32, 48, 136, 384
|
259 |
+
pretrained.act_postprocess1 = nn.Sequential(
|
260 |
+
readout_oper[0],
|
261 |
+
Transpose(1, 2),
|
262 |
+
nn.Unflatten(2, torch.Size([size[0] // 16, size[1] // 16])),
|
263 |
+
nn.Conv2d(
|
264 |
+
in_channels=vit_features,
|
265 |
+
out_channels=features[0],
|
266 |
+
kernel_size=1,
|
267 |
+
stride=1,
|
268 |
+
padding=0,
|
269 |
+
),
|
270 |
+
nn.ConvTranspose2d(
|
271 |
+
in_channels=features[0],
|
272 |
+
out_channels=features[0],
|
273 |
+
kernel_size=4,
|
274 |
+
stride=4,
|
275 |
+
padding=0,
|
276 |
+
bias=True,
|
277 |
+
dilation=1,
|
278 |
+
groups=1,
|
279 |
+
),
|
280 |
+
)
|
281 |
+
|
282 |
+
pretrained.act_postprocess2 = nn.Sequential(
|
283 |
+
readout_oper[1],
|
284 |
+
Transpose(1, 2),
|
285 |
+
nn.Unflatten(2, torch.Size([size[0] // 16, size[1] // 16])),
|
286 |
+
nn.Conv2d(
|
287 |
+
in_channels=vit_features,
|
288 |
+
out_channels=features[1],
|
289 |
+
kernel_size=1,
|
290 |
+
stride=1,
|
291 |
+
padding=0,
|
292 |
+
),
|
293 |
+
nn.ConvTranspose2d(
|
294 |
+
in_channels=features[1],
|
295 |
+
out_channels=features[1],
|
296 |
+
kernel_size=2,
|
297 |
+
stride=2,
|
298 |
+
padding=0,
|
299 |
+
bias=True,
|
300 |
+
dilation=1,
|
301 |
+
groups=1,
|
302 |
+
),
|
303 |
+
)
|
304 |
+
|
305 |
+
pretrained.act_postprocess3 = nn.Sequential(
|
306 |
+
readout_oper[2],
|
307 |
+
Transpose(1, 2),
|
308 |
+
nn.Unflatten(2, torch.Size([size[0] // 16, size[1] // 16])),
|
309 |
+
nn.Conv2d(
|
310 |
+
in_channels=vit_features,
|
311 |
+
out_channels=features[2],
|
312 |
+
kernel_size=1,
|
313 |
+
stride=1,
|
314 |
+
padding=0,
|
315 |
+
),
|
316 |
+
)
|
317 |
+
|
318 |
+
pretrained.act_postprocess4 = nn.Sequential(
|
319 |
+
readout_oper[3],
|
320 |
+
Transpose(1, 2),
|
321 |
+
nn.Unflatten(2, torch.Size([size[0] // 16, size[1] // 16])),
|
322 |
+
nn.Conv2d(
|
323 |
+
in_channels=vit_features,
|
324 |
+
out_channels=features[3],
|
325 |
+
kernel_size=1,
|
326 |
+
stride=1,
|
327 |
+
padding=0,
|
328 |
+
),
|
329 |
+
nn.Conv2d(
|
330 |
+
in_channels=features[3],
|
331 |
+
out_channels=features[3],
|
332 |
+
kernel_size=3,
|
333 |
+
stride=2,
|
334 |
+
padding=1,
|
335 |
+
),
|
336 |
+
)
|
337 |
+
|
338 |
+
pretrained.model.start_index = start_index
|
339 |
+
pretrained.model.patch_size = [16, 16]
|
340 |
+
|
341 |
+
# We inject this function into the VisionTransformer instances so that
|
342 |
+
# we can use it with interpolated position embeddings without modifying the library source.
|
343 |
+
pretrained.model.forward_flex = types.MethodType(forward_flex, pretrained.model)
|
344 |
+
pretrained.model._resize_pos_embed = types.MethodType(
|
345 |
+
_resize_pos_embed, pretrained.model
|
346 |
+
)
|
347 |
+
|
348 |
+
return pretrained
|
349 |
+
|
350 |
+
|
351 |
+
def _make_vit_b_rn50_backbone(
|
352 |
+
model,
|
353 |
+
features=[256, 512, 768, 768],
|
354 |
+
size=[384, 384],
|
355 |
+
hooks=[0, 1, 8, 11],
|
356 |
+
vit_features=768,
|
357 |
+
use_vit_only=False,
|
358 |
+
use_readout="ignore",
|
359 |
+
start_index=1,
|
360 |
+
enable_attention_hooks=False,
|
361 |
+
):
|
362 |
+
pretrained = nn.Module()
|
363 |
+
|
364 |
+
pretrained.model = model
|
365 |
+
|
366 |
+
if use_vit_only == True:
|
367 |
+
pretrained.model.blocks[hooks[0]].register_forward_hook(get_activation("1"))
|
368 |
+
pretrained.model.blocks[hooks[1]].register_forward_hook(get_activation("2"))
|
369 |
+
else:
|
370 |
+
pretrained.model.patch_embed.backbone.stages[0].register_forward_hook(
|
371 |
+
get_activation("1")
|
372 |
+
)
|
373 |
+
pretrained.model.patch_embed.backbone.stages[1].register_forward_hook(
|
374 |
+
get_activation("2")
|
375 |
+
)
|
376 |
+
|
377 |
+
pretrained.model.blocks[hooks[2]].register_forward_hook(get_activation("3"))
|
378 |
+
pretrained.model.blocks[hooks[3]].register_forward_hook(get_activation("4"))
|
379 |
+
|
380 |
+
if enable_attention_hooks:
|
381 |
+
pretrained.model.blocks[2].attn.register_forward_hook(get_attention("attn_1"))
|
382 |
+
pretrained.model.blocks[5].attn.register_forward_hook(get_attention("attn_2"))
|
383 |
+
pretrained.model.blocks[8].attn.register_forward_hook(get_attention("attn_3"))
|
384 |
+
pretrained.model.blocks[11].attn.register_forward_hook(get_attention("attn_4"))
|
385 |
+
pretrained.attention = attention
|
386 |
+
|
387 |
+
pretrained.activations = activations
|
388 |
+
|
389 |
+
readout_oper = get_readout_oper(vit_features, features, use_readout, start_index)
|
390 |
+
|
391 |
+
if use_vit_only == True:
|
392 |
+
pretrained.act_postprocess1 = nn.Sequential(
|
393 |
+
readout_oper[0],
|
394 |
+
Transpose(1, 2),
|
395 |
+
nn.Unflatten(2, torch.Size([size[0] // 16, size[1] // 16])),
|
396 |
+
nn.Conv2d(
|
397 |
+
in_channels=vit_features,
|
398 |
+
out_channels=features[0],
|
399 |
+
kernel_size=1,
|
400 |
+
stride=1,
|
401 |
+
padding=0,
|
402 |
+
),
|
403 |
+
nn.ConvTranspose2d(
|
404 |
+
in_channels=features[0],
|
405 |
+
out_channels=features[0],
|
406 |
+
kernel_size=4,
|
407 |
+
stride=4,
|
408 |
+
padding=0,
|
409 |
+
bias=True,
|
410 |
+
dilation=1,
|
411 |
+
groups=1,
|
412 |
+
),
|
413 |
+
)
|
414 |
+
|
415 |
+
pretrained.act_postprocess2 = nn.Sequential(
|
416 |
+
readout_oper[1],
|
417 |
+
Transpose(1, 2),
|
418 |
+
nn.Unflatten(2, torch.Size([size[0] // 16, size[1] // 16])),
|
419 |
+
nn.Conv2d(
|
420 |
+
in_channels=vit_features,
|
421 |
+
out_channels=features[1],
|
422 |
+
kernel_size=1,
|
423 |
+
stride=1,
|
424 |
+
padding=0,
|
425 |
+
),
|
426 |
+
nn.ConvTranspose2d(
|
427 |
+
in_channels=features[1],
|
428 |
+
out_channels=features[1],
|
429 |
+
kernel_size=2,
|
430 |
+
stride=2,
|
431 |
+
padding=0,
|
432 |
+
bias=True,
|
433 |
+
dilation=1,
|
434 |
+
groups=1,
|
435 |
+
),
|
436 |
+
)
|
437 |
+
else:
|
438 |
+
pretrained.act_postprocess1 = nn.Sequential(
|
439 |
+
nn.Identity(), nn.Identity(), nn.Identity()
|
440 |
+
)
|
441 |
+
pretrained.act_postprocess2 = nn.Sequential(
|
442 |
+
nn.Identity(), nn.Identity(), nn.Identity()
|
443 |
+
)
|
444 |
+
|
445 |
+
pretrained.act_postprocess3 = nn.Sequential(
|
446 |
+
readout_oper[2],
|
447 |
+
Transpose(1, 2),
|
448 |
+
nn.Unflatten(2, torch.Size([size[0] // 16, size[1] // 16])),
|
449 |
+
nn.Conv2d(
|
450 |
+
in_channels=vit_features,
|
451 |
+
out_channels=features[2],
|
452 |
+
kernel_size=1,
|
453 |
+
stride=1,
|
454 |
+
padding=0,
|
455 |
+
),
|
456 |
+
)
|
457 |
+
|
458 |
+
pretrained.act_postprocess4 = nn.Sequential(
|
459 |
+
readout_oper[3],
|
460 |
+
Transpose(1, 2),
|
461 |
+
nn.Unflatten(2, torch.Size([size[0] // 16, size[1] // 16])),
|
462 |
+
nn.Conv2d(
|
463 |
+
in_channels=vit_features,
|
464 |
+
out_channels=features[3],
|
465 |
+
kernel_size=1,
|
466 |
+
stride=1,
|
467 |
+
padding=0,
|
468 |
+
),
|
469 |
+
nn.Conv2d(
|
470 |
+
in_channels=features[3],
|
471 |
+
out_channels=features[3],
|
472 |
+
kernel_size=3,
|
473 |
+
stride=2,
|
474 |
+
padding=1,
|
475 |
+
),
|
476 |
+
)
|
477 |
+
|
478 |
+
pretrained.model.start_index = start_index
|
479 |
+
pretrained.model.patch_size = [16, 16]
|
480 |
+
|
481 |
+
# We inject this function into the VisionTransformer instances so that
|
482 |
+
# we can use it with interpolated position embeddings without modifying the library source.
|
483 |
+
pretrained.model.forward_flex = types.MethodType(forward_flex, pretrained.model)
|
484 |
+
|
485 |
+
# We inject this function into the VisionTransformer instances so that
|
486 |
+
# we can use it with interpolated position embeddings without modifying the library source.
|
487 |
+
pretrained.model._resize_pos_embed = types.MethodType(
|
488 |
+
_resize_pos_embed, pretrained.model
|
489 |
+
)
|
490 |
+
|
491 |
+
return pretrained
|
492 |
+
|
493 |
+
|
494 |
+
def _make_pretrained_vitb_rn50_384(
|
495 |
+
pretrained,
|
496 |
+
use_readout="ignore",
|
497 |
+
hooks=None,
|
498 |
+
use_vit_only=False,
|
499 |
+
enable_attention_hooks=False,
|
500 |
+
):
|
501 |
+
model = timm.create_model("vit_base_resnet50_384", pretrained=pretrained)
|
502 |
+
|
503 |
+
hooks = [0, 1, 8, 11] if hooks == None else hooks
|
504 |
+
return _make_vit_b_rn50_backbone(
|
505 |
+
model,
|
506 |
+
features=[256, 512, 768, 768],
|
507 |
+
size=[384, 384],
|
508 |
+
hooks=hooks,
|
509 |
+
use_vit_only=use_vit_only,
|
510 |
+
use_readout=use_readout,
|
511 |
+
enable_attention_hooks=enable_attention_hooks,
|
512 |
+
)
|
513 |
+
|
514 |
+
|
515 |
+
def _make_pretrained_vitl16_384(
|
516 |
+
pretrained, use_readout="ignore", hooks=None, enable_attention_hooks=False
|
517 |
+
):
|
518 |
+
model = timm.create_model("vit_large_patch16_384", pretrained=pretrained)
|
519 |
+
|
520 |
+
hooks = [5, 11, 17, 23] if hooks == None else hooks
|
521 |
+
return _make_vit_b16_backbone(
|
522 |
+
model,
|
523 |
+
features=[256, 512, 1024, 1024],
|
524 |
+
hooks=hooks,
|
525 |
+
vit_features=1024,
|
526 |
+
use_readout=use_readout,
|
527 |
+
enable_attention_hooks=enable_attention_hooks,
|
528 |
+
)
|
529 |
+
|
530 |
+
|
531 |
+
def _make_pretrained_vitb16_384(
|
532 |
+
pretrained, use_readout="ignore", hooks=None, enable_attention_hooks=False
|
533 |
+
):
|
534 |
+
model = timm.create_model("vit_base_patch16_384", pretrained=pretrained)
|
535 |
+
|
536 |
+
hooks = [2, 5, 8, 11] if hooks == None else hooks
|
537 |
+
return _make_vit_b16_backbone(
|
538 |
+
model,
|
539 |
+
features=[96, 192, 384, 768],
|
540 |
+
hooks=hooks,
|
541 |
+
use_readout=use_readout,
|
542 |
+
enable_attention_hooks=enable_attention_hooks,
|
543 |
+
)
|
544 |
+
|
545 |
+
|
546 |
+
def _make_pretrained_deitb16_384(
|
547 |
+
pretrained, use_readout="ignore", hooks=None, enable_attention_hooks=False
|
548 |
+
):
|
549 |
+
model = timm.create_model("vit_deit_base_patch16_384", pretrained=pretrained)
|
550 |
+
|
551 |
+
hooks = [2, 5, 8, 11] if hooks == None else hooks
|
552 |
+
return _make_vit_b16_backbone(
|
553 |
+
model,
|
554 |
+
features=[96, 192, 384, 768],
|
555 |
+
hooks=hooks,
|
556 |
+
use_readout=use_readout,
|
557 |
+
enable_attention_hooks=enable_attention_hooks,
|
558 |
+
)
|
559 |
+
|
560 |
+
|
561 |
+
def _make_pretrained_deitb16_distil_384(
|
562 |
+
pretrained, use_readout="ignore", hooks=None, enable_attention_hooks=False
|
563 |
+
):
|
564 |
+
model = timm.create_model(
|
565 |
+
"vit_deit_base_distilled_patch16_384", pretrained=pretrained
|
566 |
+
)
|
567 |
+
|
568 |
+
hooks = [2, 5, 8, 11] if hooks == None else hooks
|
569 |
+
return _make_vit_b16_backbone(
|
570 |
+
model,
|
571 |
+
features=[96, 192, 384, 768],
|
572 |
+
hooks=hooks,
|
573 |
+
use_readout=use_readout,
|
574 |
+
start_index=2,
|
575 |
+
enable_attention_hooks=enable_attention_hooks,
|
576 |
+
)
|
mask2former/__init__.py
ADDED
@@ -0,0 +1,26 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Copyright (c) Facebook, Inc. and its affiliates.
|
2 |
+
from . import data # register all new datasets
|
3 |
+
from . import modeling
|
4 |
+
|
5 |
+
# config
|
6 |
+
from .config import add_maskformer2_config
|
7 |
+
|
8 |
+
# dataset loading
|
9 |
+
from .data.dataset_mappers.coco_instance_new_baseline_dataset_mapper import COCOInstanceNewBaselineDatasetMapper
|
10 |
+
from .data.dataset_mappers.coco_panoptic_new_baseline_dataset_mapper import COCOPanopticNewBaselineDatasetMapper
|
11 |
+
from .data.dataset_mappers.mask_former_instance_dataset_mapper import (
|
12 |
+
MaskFormerInstanceDatasetMapper,
|
13 |
+
)
|
14 |
+
from .data.dataset_mappers.mask_former_panoptic_dataset_mapper import (
|
15 |
+
MaskFormerPanopticDatasetMapper,
|
16 |
+
)
|
17 |
+
from .data.dataset_mappers.mask_former_semantic_dataset_mapper import (
|
18 |
+
MaskFormerSemanticDatasetMapper,
|
19 |
+
)
|
20 |
+
|
21 |
+
# models
|
22 |
+
from .maskformer_model import MaskFormer
|
23 |
+
from .test_time_augmentation import SemanticSegmentorWithTTA
|
24 |
+
|
25 |
+
# evaluation
|
26 |
+
from .evaluation.instance_evaluation import InstanceSegEvaluator
|
mask2former/config.py
ADDED
@@ -0,0 +1,114 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# -*- coding: utf-8 -*-
|
2 |
+
# Copyright (c) Facebook, Inc. and its affiliates.
|
3 |
+
from detectron2.config import CfgNode as CN
|
4 |
+
|
5 |
+
|
6 |
+
def add_maskformer2_config(cfg):
|
7 |
+
"""
|
8 |
+
Add config for MASK_FORMER.
|
9 |
+
"""
|
10 |
+
# NOTE: configs from original maskformer
|
11 |
+
# data config
|
12 |
+
# select the dataset mapper
|
13 |
+
cfg.INPUT.DATASET_MAPPER_NAME = "mask_former_semantic"
|
14 |
+
# Color augmentation
|
15 |
+
cfg.INPUT.COLOR_AUG_SSD = False
|
16 |
+
# We retry random cropping until no single category in semantic segmentation GT occupies more
|
17 |
+
# than `SINGLE_CATEGORY_MAX_AREA` part of the crop.
|
18 |
+
cfg.INPUT.CROP.SINGLE_CATEGORY_MAX_AREA = 1.0
|
19 |
+
# Pad image and segmentation GT in dataset mapper.
|
20 |
+
cfg.INPUT.SIZE_DIVISIBILITY = -1
|
21 |
+
|
22 |
+
# solver config
|
23 |
+
# weight decay on embedding
|
24 |
+
cfg.SOLVER.WEIGHT_DECAY_EMBED = 0.0
|
25 |
+
# optimizer
|
26 |
+
cfg.SOLVER.OPTIMIZER = "ADAMW"
|
27 |
+
cfg.SOLVER.BACKBONE_MULTIPLIER = 0.1
|
28 |
+
|
29 |
+
# mask_former model config
|
30 |
+
cfg.MODEL.MASK_FORMER = CN()
|
31 |
+
|
32 |
+
# loss
|
33 |
+
cfg.MODEL.MASK_FORMER.DEEP_SUPERVISION = True
|
34 |
+
cfg.MODEL.MASK_FORMER.NO_OBJECT_WEIGHT = 0.1
|
35 |
+
cfg.MODEL.MASK_FORMER.CLASS_WEIGHT = 1.0
|
36 |
+
cfg.MODEL.MASK_FORMER.DICE_WEIGHT = 1.0
|
37 |
+
cfg.MODEL.MASK_FORMER.MASK_WEIGHT = 20.0
|
38 |
+
|
39 |
+
# transformer config
|
40 |
+
cfg.MODEL.MASK_FORMER.NHEADS = 8
|
41 |
+
cfg.MODEL.MASK_FORMER.DROPOUT = 0.1
|
42 |
+
cfg.MODEL.MASK_FORMER.DIM_FEEDFORWARD = 2048
|
43 |
+
cfg.MODEL.MASK_FORMER.ENC_LAYERS = 0
|
44 |
+
cfg.MODEL.MASK_FORMER.DEC_LAYERS = 6
|
45 |
+
cfg.MODEL.MASK_FORMER.PRE_NORM = False
|
46 |
+
|
47 |
+
cfg.MODEL.MASK_FORMER.HIDDEN_DIM = 256
|
48 |
+
cfg.MODEL.MASK_FORMER.NUM_OBJECT_QUERIES = 100
|
49 |
+
|
50 |
+
cfg.MODEL.MASK_FORMER.TRANSFORMER_IN_FEATURE = "res5"
|
51 |
+
cfg.MODEL.MASK_FORMER.ENFORCE_INPUT_PROJ = False
|
52 |
+
|
53 |
+
# mask_former inference config
|
54 |
+
cfg.MODEL.MASK_FORMER.TEST = CN()
|
55 |
+
cfg.MODEL.MASK_FORMER.TEST.SEMANTIC_ON = True
|
56 |
+
cfg.MODEL.MASK_FORMER.TEST.INSTANCE_ON = False
|
57 |
+
cfg.MODEL.MASK_FORMER.TEST.PANOPTIC_ON = False
|
58 |
+
cfg.MODEL.MASK_FORMER.TEST.OBJECT_MASK_THRESHOLD = 0.0
|
59 |
+
cfg.MODEL.MASK_FORMER.TEST.OVERLAP_THRESHOLD = 0.0
|
60 |
+
cfg.MODEL.MASK_FORMER.TEST.SEM_SEG_POSTPROCESSING_BEFORE_INFERENCE = False
|
61 |
+
|
62 |
+
# Sometimes `backbone.size_divisibility` is set to 0 for some backbone (e.g. ResNet)
|
63 |
+
# you can use this config to override
|
64 |
+
cfg.MODEL.MASK_FORMER.SIZE_DIVISIBILITY = 32
|
65 |
+
|
66 |
+
# pixel decoder config
|
67 |
+
cfg.MODEL.SEM_SEG_HEAD.MASK_DIM = 256
|
68 |
+
# adding transformer in pixel decoder
|
69 |
+
cfg.MODEL.SEM_SEG_HEAD.TRANSFORMER_ENC_LAYERS = 0
|
70 |
+
# pixel decoder
|
71 |
+
cfg.MODEL.SEM_SEG_HEAD.PIXEL_DECODER_NAME = "BasePixelDecoder"
|
72 |
+
|
73 |
+
# swin transformer backbone
|
74 |
+
cfg.MODEL.SWIN = CN()
|
75 |
+
cfg.MODEL.SWIN.PRETRAIN_IMG_SIZE = 224
|
76 |
+
cfg.MODEL.SWIN.PATCH_SIZE = 4
|
77 |
+
cfg.MODEL.SWIN.EMBED_DIM = 96
|
78 |
+
cfg.MODEL.SWIN.DEPTHS = [2, 2, 6, 2]
|
79 |
+
cfg.MODEL.SWIN.NUM_HEADS = [3, 6, 12, 24]
|
80 |
+
cfg.MODEL.SWIN.WINDOW_SIZE = 7
|
81 |
+
cfg.MODEL.SWIN.MLP_RATIO = 4.0
|
82 |
+
cfg.MODEL.SWIN.QKV_BIAS = True
|
83 |
+
cfg.MODEL.SWIN.QK_SCALE = None
|
84 |
+
cfg.MODEL.SWIN.DROP_RATE = 0.0
|
85 |
+
cfg.MODEL.SWIN.ATTN_DROP_RATE = 0.0
|
86 |
+
cfg.MODEL.SWIN.DROP_PATH_RATE = 0.3
|
87 |
+
cfg.MODEL.SWIN.APE = False
|
88 |
+
cfg.MODEL.SWIN.PATCH_NORM = True
|
89 |
+
cfg.MODEL.SWIN.OUT_FEATURES = ["res2", "res3", "res4", "res5"]
|
90 |
+
cfg.MODEL.SWIN.USE_CHECKPOINT = False
|
91 |
+
|
92 |
+
# NOTE: maskformer2 extra configs
|
93 |
+
# transformer module
|
94 |
+
cfg.MODEL.MASK_FORMER.TRANSFORMER_DECODER_NAME = "MultiScaleMaskedTransformerDecoder"
|
95 |
+
|
96 |
+
# LSJ aug
|
97 |
+
cfg.INPUT.IMAGE_SIZE = 1024
|
98 |
+
cfg.INPUT.MIN_SCALE = 0.1
|
99 |
+
cfg.INPUT.MAX_SCALE = 2.0
|
100 |
+
|
101 |
+
# MSDeformAttn encoder configs
|
102 |
+
cfg.MODEL.SEM_SEG_HEAD.DEFORMABLE_TRANSFORMER_ENCODER_IN_FEATURES = ["res3", "res4", "res5"]
|
103 |
+
cfg.MODEL.SEM_SEG_HEAD.DEFORMABLE_TRANSFORMER_ENCODER_N_POINTS = 4
|
104 |
+
cfg.MODEL.SEM_SEG_HEAD.DEFORMABLE_TRANSFORMER_ENCODER_N_HEADS = 8
|
105 |
+
|
106 |
+
# point loss configs
|
107 |
+
# Number of points sampled during training for a mask point head.
|
108 |
+
cfg.MODEL.MASK_FORMER.TRAIN_NUM_POINTS = 112 * 112
|
109 |
+
# Oversampling parameter for PointRend point sampling during training. Parameter `k` in the
|
110 |
+
# original paper.
|
111 |
+
cfg.MODEL.MASK_FORMER.OVERSAMPLE_RATIO = 3.0
|
112 |
+
# Importance sampling parameter for PointRend point sampling during training. Parametr `beta` in
|
113 |
+
# the original paper.
|
114 |
+
cfg.MODEL.MASK_FORMER.IMPORTANCE_SAMPLE_RATIO = 0.75
|
mask2former/configs/ade20k/instance-segmentation/Base-ADE20K-InstanceSegmentation.yaml
ADDED
@@ -0,0 +1,61 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
MODEL:
|
2 |
+
BACKBONE:
|
3 |
+
FREEZE_AT: 0
|
4 |
+
NAME: "build_resnet_backbone"
|
5 |
+
WEIGHTS: "detectron2://ImageNetPretrained/torchvision/R-50.pkl"
|
6 |
+
PIXEL_MEAN: [123.675, 116.280, 103.530]
|
7 |
+
PIXEL_STD: [58.395, 57.120, 57.375]
|
8 |
+
RESNETS:
|
9 |
+
DEPTH: 50
|
10 |
+
STEM_TYPE: "basic" # not used
|
11 |
+
STEM_OUT_CHANNELS: 64
|
12 |
+
STRIDE_IN_1X1: False
|
13 |
+
OUT_FEATURES: ["res2", "res3", "res4", "res5"]
|
14 |
+
# NORM: "SyncBN"
|
15 |
+
RES5_MULTI_GRID: [1, 1, 1] # not used
|
16 |
+
DATASETS:
|
17 |
+
TRAIN: ("ade20k_instance_train",)
|
18 |
+
TEST: ("ade20k_instance_val",)
|
19 |
+
SOLVER:
|
20 |
+
IMS_PER_BATCH: 16
|
21 |
+
BASE_LR: 0.0001
|
22 |
+
MAX_ITER: 160000
|
23 |
+
WARMUP_FACTOR: 1.0
|
24 |
+
WARMUP_ITERS: 0
|
25 |
+
WEIGHT_DECAY: 0.05
|
26 |
+
OPTIMIZER: "ADAMW"
|
27 |
+
LR_SCHEDULER_NAME: "WarmupPolyLR"
|
28 |
+
BACKBONE_MULTIPLIER: 0.1
|
29 |
+
CLIP_GRADIENTS:
|
30 |
+
ENABLED: True
|
31 |
+
CLIP_TYPE: "full_model"
|
32 |
+
CLIP_VALUE: 0.01
|
33 |
+
NORM_TYPE: 2.0
|
34 |
+
AMP:
|
35 |
+
ENABLED: True
|
36 |
+
INPUT:
|
37 |
+
MIN_SIZE_TRAIN: !!python/object/apply:eval ["[int(x * 0.1 * 640) for x in range(5, 21)]"]
|
38 |
+
MIN_SIZE_TRAIN_SAMPLING: "choice"
|
39 |
+
MIN_SIZE_TEST: 640
|
40 |
+
MAX_SIZE_TRAIN: 2560
|
41 |
+
MAX_SIZE_TEST: 2560
|
42 |
+
CROP:
|
43 |
+
ENABLED: True
|
44 |
+
TYPE: "absolute"
|
45 |
+
SIZE: (640, 640)
|
46 |
+
SINGLE_CATEGORY_MAX_AREA: 1.0
|
47 |
+
COLOR_AUG_SSD: True
|
48 |
+
SIZE_DIVISIBILITY: 640 # used in dataset mapper
|
49 |
+
FORMAT: "RGB"
|
50 |
+
DATASET_MAPPER_NAME: "mask_former_instance"
|
51 |
+
TEST:
|
52 |
+
EVAL_PERIOD: 5000
|
53 |
+
AUG:
|
54 |
+
ENABLED: False
|
55 |
+
MIN_SIZES: [320, 480, 640, 800, 960, 1120]
|
56 |
+
MAX_SIZE: 4480
|
57 |
+
FLIP: True
|
58 |
+
DATALOADER:
|
59 |
+
FILTER_EMPTY_ANNOTATIONS: True
|
60 |
+
NUM_WORKERS: 4
|
61 |
+
VERSION: 2
|
mask2former/configs/ade20k/instance-segmentation/maskformer2_R50_bs16_160k.yaml
ADDED
@@ -0,0 +1,44 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
_BASE_: Base-ADE20K-InstanceSegmentation.yaml
|
2 |
+
MODEL:
|
3 |
+
META_ARCHITECTURE: "MaskFormer"
|
4 |
+
SEM_SEG_HEAD:
|
5 |
+
NAME: "MaskFormerHead"
|
6 |
+
IGNORE_VALUE: 255
|
7 |
+
NUM_CLASSES: 100
|
8 |
+
LOSS_WEIGHT: 1.0
|
9 |
+
CONVS_DIM: 256
|
10 |
+
MASK_DIM: 256
|
11 |
+
NORM: "GN"
|
12 |
+
# pixel decoder
|
13 |
+
PIXEL_DECODER_NAME: "MSDeformAttnPixelDecoder"
|
14 |
+
IN_FEATURES: ["res2", "res3", "res4", "res5"]
|
15 |
+
DEFORMABLE_TRANSFORMER_ENCODER_IN_FEATURES: ["res3", "res4", "res5"]
|
16 |
+
COMMON_STRIDE: 4
|
17 |
+
TRANSFORMER_ENC_LAYERS: 6
|
18 |
+
MASK_FORMER:
|
19 |
+
TRANSFORMER_DECODER_NAME: "MultiScaleMaskedTransformerDecoder"
|
20 |
+
TRANSFORMER_IN_FEATURE: "multi_scale_pixel_decoder"
|
21 |
+
DEEP_SUPERVISION: True
|
22 |
+
NO_OBJECT_WEIGHT: 0.1
|
23 |
+
CLASS_WEIGHT: 2.0
|
24 |
+
MASK_WEIGHT: 5.0
|
25 |
+
DICE_WEIGHT: 5.0
|
26 |
+
HIDDEN_DIM: 256
|
27 |
+
NUM_OBJECT_QUERIES: 100
|
28 |
+
NHEADS: 8
|
29 |
+
DROPOUT: 0.0
|
30 |
+
DIM_FEEDFORWARD: 2048
|
31 |
+
ENC_LAYERS: 0
|
32 |
+
PRE_NORM: False
|
33 |
+
ENFORCE_INPUT_PROJ: False
|
34 |
+
SIZE_DIVISIBILITY: 32
|
35 |
+
DEC_LAYERS: 10 # 9 decoder layers, add one for the loss on learnable query
|
36 |
+
TRAIN_NUM_POINTS: 12544
|
37 |
+
OVERSAMPLE_RATIO: 3.0
|
38 |
+
IMPORTANCE_SAMPLE_RATIO: 0.75
|
39 |
+
TEST:
|
40 |
+
SEMANTIC_ON: True
|
41 |
+
INSTANCE_ON: True
|
42 |
+
PANOPTIC_ON: True
|
43 |
+
OVERLAP_THRESHOLD: 0.8
|
44 |
+
OBJECT_MASK_THRESHOLD: 0.8
|
mask2former/configs/ade20k/instance-segmentation/swin/maskformer2_swin_large_IN21k_384_bs16_160k.yaml
ADDED
@@ -0,0 +1,18 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
_BASE_: ../maskformer2_R50_bs16_160k.yaml
|
2 |
+
MODEL:
|
3 |
+
BACKBONE:
|
4 |
+
NAME: "D2SwinTransformer"
|
5 |
+
SWIN:
|
6 |
+
EMBED_DIM: 192
|
7 |
+
DEPTHS: [2, 2, 18, 2]
|
8 |
+
NUM_HEADS: [6, 12, 24, 48]
|
9 |
+
WINDOW_SIZE: 12
|
10 |
+
APE: False
|
11 |
+
DROP_PATH_RATE: 0.3
|
12 |
+
PATCH_NORM: True
|
13 |
+
PRETRAIN_IMG_SIZE: 384
|
14 |
+
WEIGHTS: "swin_large_patch4_window12_384_22k.pkl"
|
15 |
+
PIXEL_MEAN: [123.675, 116.280, 103.530]
|
16 |
+
PIXEL_STD: [58.395, 57.120, 57.375]
|
17 |
+
MASK_FORMER:
|
18 |
+
NUM_OBJECT_QUERIES: 200
|
mask2former/configs/ade20k/panoptic-segmentation/Base-ADE20K-PanopticSegmentation.yaml
ADDED
@@ -0,0 +1,61 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
MODEL:
|
2 |
+
BACKBONE:
|
3 |
+
FREEZE_AT: 0
|
4 |
+
NAME: "build_resnet_backbone"
|
5 |
+
WEIGHTS: "detectron2://ImageNetPretrained/torchvision/R-50.pkl"
|
6 |
+
PIXEL_MEAN: [123.675, 116.280, 103.530]
|
7 |
+
PIXEL_STD: [58.395, 57.120, 57.375]
|
8 |
+
RESNETS:
|
9 |
+
DEPTH: 50
|
10 |
+
STEM_TYPE: "basic" # not used
|
11 |
+
STEM_OUT_CHANNELS: 64
|
12 |
+
STRIDE_IN_1X1: False
|
13 |
+
OUT_FEATURES: ["res2", "res3", "res4", "res5"]
|
14 |
+
# NORM: "SyncBN"
|
15 |
+
RES5_MULTI_GRID: [1, 1, 1] # not used
|
16 |
+
DATASETS:
|
17 |
+
TRAIN: ("ade20k_panoptic_train",)
|
18 |
+
TEST: ("ade20k_panoptic_val",)
|
19 |
+
SOLVER:
|
20 |
+
IMS_PER_BATCH: 16
|
21 |
+
BASE_LR: 0.0001
|
22 |
+
MAX_ITER: 160000
|
23 |
+
WARMUP_FACTOR: 1.0
|
24 |
+
WARMUP_ITERS: 0
|
25 |
+
WEIGHT_DECAY: 0.05
|
26 |
+
OPTIMIZER: "ADAMW"
|
27 |
+
LR_SCHEDULER_NAME: "WarmupPolyLR"
|
28 |
+
BACKBONE_MULTIPLIER: 0.1
|
29 |
+
CLIP_GRADIENTS:
|
30 |
+
ENABLED: True
|
31 |
+
CLIP_TYPE: "full_model"
|
32 |
+
CLIP_VALUE: 0.01
|
33 |
+
NORM_TYPE: 2.0
|
34 |
+
AMP:
|
35 |
+
ENABLED: True
|
36 |
+
INPUT:
|
37 |
+
MIN_SIZE_TRAIN: !!python/object/apply:eval ["[int(x * 0.1 * 640) for x in range(5, 21)]"]
|
38 |
+
MIN_SIZE_TRAIN_SAMPLING: "choice"
|
39 |
+
MIN_SIZE_TEST: 640
|
40 |
+
MAX_SIZE_TRAIN: 2560
|
41 |
+
MAX_SIZE_TEST: 2560
|
42 |
+
CROP:
|
43 |
+
ENABLED: True
|
44 |
+
TYPE: "absolute"
|
45 |
+
SIZE: (640, 640)
|
46 |
+
SINGLE_CATEGORY_MAX_AREA: 1.0
|
47 |
+
COLOR_AUG_SSD: True
|
48 |
+
SIZE_DIVISIBILITY: 640 # used in dataset mapper
|
49 |
+
FORMAT: "RGB"
|
50 |
+
DATASET_MAPPER_NAME: "mask_former_panoptic"
|
51 |
+
TEST:
|
52 |
+
EVAL_PERIOD: 5000
|
53 |
+
AUG:
|
54 |
+
ENABLED: False
|
55 |
+
MIN_SIZES: [320, 480, 640, 800, 960, 1120]
|
56 |
+
MAX_SIZE: 4480
|
57 |
+
FLIP: True
|
58 |
+
DATALOADER:
|
59 |
+
FILTER_EMPTY_ANNOTATIONS: True
|
60 |
+
NUM_WORKERS: 4
|
61 |
+
VERSION: 2
|
mask2former/configs/ade20k/panoptic-segmentation/maskformer2_R50_bs16_160k.yaml
ADDED
@@ -0,0 +1,44 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
_BASE_: Base-ADE20K-PanopticSegmentation.yaml
|
2 |
+
MODEL:
|
3 |
+
META_ARCHITECTURE: "MaskFormer"
|
4 |
+
SEM_SEG_HEAD:
|
5 |
+
NAME: "MaskFormerHead"
|
6 |
+
IGNORE_VALUE: 255
|
7 |
+
NUM_CLASSES: 150
|
8 |
+
LOSS_WEIGHT: 1.0
|
9 |
+
CONVS_DIM: 256
|
10 |
+
MASK_DIM: 256
|
11 |
+
NORM: "GN"
|
12 |
+
# pixel decoder
|
13 |
+
PIXEL_DECODER_NAME: "MSDeformAttnPixelDecoder"
|
14 |
+
IN_FEATURES: ["res2", "res3", "res4", "res5"]
|
15 |
+
DEFORMABLE_TRANSFORMER_ENCODER_IN_FEATURES: ["res3", "res4", "res5"]
|
16 |
+
COMMON_STRIDE: 4
|
17 |
+
TRANSFORMER_ENC_LAYERS: 6
|
18 |
+
MASK_FORMER:
|
19 |
+
TRANSFORMER_DECODER_NAME: "MultiScaleMaskedTransformerDecoder"
|
20 |
+
TRANSFORMER_IN_FEATURE: "multi_scale_pixel_decoder"
|
21 |
+
DEEP_SUPERVISION: True
|
22 |
+
NO_OBJECT_WEIGHT: 0.1
|
23 |
+
CLASS_WEIGHT: 2.0
|
24 |
+
MASK_WEIGHT: 5.0
|
25 |
+
DICE_WEIGHT: 5.0
|
26 |
+
HIDDEN_DIM: 256
|
27 |
+
NUM_OBJECT_QUERIES: 100
|
28 |
+
NHEADS: 8
|
29 |
+
DROPOUT: 0.0
|
30 |
+
DIM_FEEDFORWARD: 2048
|
31 |
+
ENC_LAYERS: 0
|
32 |
+
PRE_NORM: False
|
33 |
+
ENFORCE_INPUT_PROJ: False
|
34 |
+
SIZE_DIVISIBILITY: 32
|
35 |
+
DEC_LAYERS: 10 # 9 decoder layers, add one for the loss on learnable query
|
36 |
+
TRAIN_NUM_POINTS: 12544
|
37 |
+
OVERSAMPLE_RATIO: 3.0
|
38 |
+
IMPORTANCE_SAMPLE_RATIO: 0.75
|
39 |
+
TEST:
|
40 |
+
SEMANTIC_ON: True
|
41 |
+
INSTANCE_ON: True
|
42 |
+
PANOPTIC_ON: True
|
43 |
+
OVERLAP_THRESHOLD: 0.8
|
44 |
+
OBJECT_MASK_THRESHOLD: 0.8
|
mask2former/configs/ade20k/panoptic-segmentation/swin/maskformer2_swin_large_IN21k_384_bs16_160k.yaml
ADDED
@@ -0,0 +1,18 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
_BASE_: ../maskformer2_R50_bs16_160k.yaml
|
2 |
+
MODEL:
|
3 |
+
BACKBONE:
|
4 |
+
NAME: "D2SwinTransformer"
|
5 |
+
SWIN:
|
6 |
+
EMBED_DIM: 192
|
7 |
+
DEPTHS: [2, 2, 18, 2]
|
8 |
+
NUM_HEADS: [6, 12, 24, 48]
|
9 |
+
WINDOW_SIZE: 12
|
10 |
+
APE: False
|
11 |
+
DROP_PATH_RATE: 0.3
|
12 |
+
PATCH_NORM: True
|
13 |
+
PRETRAIN_IMG_SIZE: 384
|
14 |
+
WEIGHTS: "swin_large_patch4_window12_384_22k.pkl"
|
15 |
+
PIXEL_MEAN: [123.675, 116.280, 103.530]
|
16 |
+
PIXEL_STD: [58.395, 57.120, 57.375]
|
17 |
+
MASK_FORMER:
|
18 |
+
NUM_OBJECT_QUERIES: 200
|
mask2former/configs/ade20k/semantic-segmentation/Base-ADE20K-SemanticSegmentation.yaml
ADDED
@@ -0,0 +1,61 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
MODEL:
|
2 |
+
BACKBONE:
|
3 |
+
FREEZE_AT: 0
|
4 |
+
NAME: "build_resnet_backbone"
|
5 |
+
WEIGHTS: "detectron2://ImageNetPretrained/torchvision/R-50.pkl"
|
6 |
+
PIXEL_MEAN: [123.675, 116.280, 103.530]
|
7 |
+
PIXEL_STD: [58.395, 57.120, 57.375]
|
8 |
+
RESNETS:
|
9 |
+
DEPTH: 50
|
10 |
+
STEM_TYPE: "basic" # not used
|
11 |
+
STEM_OUT_CHANNELS: 64
|
12 |
+
STRIDE_IN_1X1: False
|
13 |
+
OUT_FEATURES: ["res2", "res3", "res4", "res5"]
|
14 |
+
# NORM: "SyncBN"
|
15 |
+
RES5_MULTI_GRID: [1, 1, 1] # not used
|
16 |
+
DATASETS:
|
17 |
+
TRAIN: ("ade20k_sem_seg_train",)
|
18 |
+
TEST: ("ade20k_sem_seg_val",)
|
19 |
+
SOLVER:
|
20 |
+
IMS_PER_BATCH: 16
|
21 |
+
BASE_LR: 0.0001
|
22 |
+
MAX_ITER: 160000
|
23 |
+
WARMUP_FACTOR: 1.0
|
24 |
+
WARMUP_ITERS: 0
|
25 |
+
WEIGHT_DECAY: 0.05
|
26 |
+
OPTIMIZER: "ADAMW"
|
27 |
+
LR_SCHEDULER_NAME: "WarmupPolyLR"
|
28 |
+
BACKBONE_MULTIPLIER: 0.1
|
29 |
+
CLIP_GRADIENTS:
|
30 |
+
ENABLED: True
|
31 |
+
CLIP_TYPE: "full_model"
|
32 |
+
CLIP_VALUE: 0.01
|
33 |
+
NORM_TYPE: 2.0
|
34 |
+
AMP:
|
35 |
+
ENABLED: True
|
36 |
+
INPUT:
|
37 |
+
MIN_SIZE_TRAIN: !!python/object/apply:eval ["[int(x * 0.1 * 512) for x in range(5, 21)]"]
|
38 |
+
MIN_SIZE_TRAIN_SAMPLING: "choice"
|
39 |
+
MIN_SIZE_TEST: 512
|
40 |
+
MAX_SIZE_TRAIN: 2048
|
41 |
+
MAX_SIZE_TEST: 2048
|
42 |
+
CROP:
|
43 |
+
ENABLED: True
|
44 |
+
TYPE: "absolute"
|
45 |
+
SIZE: (512, 512)
|
46 |
+
SINGLE_CATEGORY_MAX_AREA: 1.0
|
47 |
+
COLOR_AUG_SSD: True
|
48 |
+
SIZE_DIVISIBILITY: 512 # used in dataset mapper
|
49 |
+
FORMAT: "RGB"
|
50 |
+
DATASET_MAPPER_NAME: "mask_former_semantic"
|
51 |
+
TEST:
|
52 |
+
EVAL_PERIOD: 5000
|
53 |
+
AUG:
|
54 |
+
ENABLED: False
|
55 |
+
MIN_SIZES: [256, 384, 512, 640, 768, 896]
|
56 |
+
MAX_SIZE: 3584
|
57 |
+
FLIP: True
|
58 |
+
DATALOADER:
|
59 |
+
FILTER_EMPTY_ANNOTATIONS: True
|
60 |
+
NUM_WORKERS: 4
|
61 |
+
VERSION: 2
|
mask2former/configs/ade20k/semantic-segmentation/maskformer2_R101_bs16_90k.yaml
ADDED
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
_BASE_: maskformer2_R50_bs16_160k.yaml
|
2 |
+
MODEL:
|
3 |
+
WEIGHTS: "R-101.pkl"
|
4 |
+
RESNETS:
|
5 |
+
DEPTH: 101
|
6 |
+
STEM_TYPE: "basic" # not used
|
7 |
+
STEM_OUT_CHANNELS: 64
|
8 |
+
STRIDE_IN_1X1: False
|
9 |
+
OUT_FEATURES: ["res2", "res3", "res4", "res5"]
|
10 |
+
NORM: "SyncBN"
|
11 |
+
RES5_MULTI_GRID: [1, 1, 1] # not used
|
mask2former/configs/ade20k/semantic-segmentation/maskformer2_R50_bs16_160k.yaml
ADDED
@@ -0,0 +1,44 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
_BASE_: Base-ADE20K-SemanticSegmentation.yaml
|
2 |
+
MODEL:
|
3 |
+
META_ARCHITECTURE: "MaskFormer"
|
4 |
+
SEM_SEG_HEAD:
|
5 |
+
NAME: "MaskFormerHead"
|
6 |
+
IGNORE_VALUE: 255
|
7 |
+
NUM_CLASSES: 150
|
8 |
+
LOSS_WEIGHT: 1.0
|
9 |
+
CONVS_DIM: 256
|
10 |
+
MASK_DIM: 256
|
11 |
+
NORM: "GN"
|
12 |
+
# pixel decoder
|
13 |
+
PIXEL_DECODER_NAME: "MSDeformAttnPixelDecoder"
|
14 |
+
IN_FEATURES: ["res2", "res3", "res4", "res5"]
|
15 |
+
DEFORMABLE_TRANSFORMER_ENCODER_IN_FEATURES: ["res3", "res4", "res5"]
|
16 |
+
COMMON_STRIDE: 4
|
17 |
+
TRANSFORMER_ENC_LAYERS: 6
|
18 |
+
MASK_FORMER:
|
19 |
+
TRANSFORMER_DECODER_NAME: "MultiScaleMaskedTransformerDecoder"
|
20 |
+
TRANSFORMER_IN_FEATURE: "multi_scale_pixel_decoder"
|
21 |
+
DEEP_SUPERVISION: True
|
22 |
+
NO_OBJECT_WEIGHT: 0.1
|
23 |
+
CLASS_WEIGHT: 2.0
|
24 |
+
MASK_WEIGHT: 5.0
|
25 |
+
DICE_WEIGHT: 5.0
|
26 |
+
HIDDEN_DIM: 256
|
27 |
+
NUM_OBJECT_QUERIES: 100
|
28 |
+
NHEADS: 8
|
29 |
+
DROPOUT: 0.0
|
30 |
+
DIM_FEEDFORWARD: 2048
|
31 |
+
ENC_LAYERS: 0
|
32 |
+
PRE_NORM: False
|
33 |
+
ENFORCE_INPUT_PROJ: False
|
34 |
+
SIZE_DIVISIBILITY: 32
|
35 |
+
DEC_LAYERS: 10 # 9 decoder layers, add one for the loss on learnable query
|
36 |
+
TRAIN_NUM_POINTS: 12544
|
37 |
+
OVERSAMPLE_RATIO: 3.0
|
38 |
+
IMPORTANCE_SAMPLE_RATIO: 0.75
|
39 |
+
TEST:
|
40 |
+
SEMANTIC_ON: True
|
41 |
+
INSTANCE_ON: False
|
42 |
+
PANOPTIC_ON: False
|
43 |
+
OVERLAP_THRESHOLD: 0.8
|
44 |
+
OBJECT_MASK_THRESHOLD: 0.8
|
mask2former/configs/ade20k/semantic-segmentation/swin/maskformer2_swin_base_384_bs16_160k_res640.yaml
ADDED
@@ -0,0 +1,37 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
_BASE_: ../maskformer2_R50_bs16_160k.yaml
|
2 |
+
MODEL:
|
3 |
+
BACKBONE:
|
4 |
+
NAME: "D2SwinTransformer"
|
5 |
+
SWIN:
|
6 |
+
EMBED_DIM: 128
|
7 |
+
DEPTHS: [2, 2, 18, 2]
|
8 |
+
NUM_HEADS: [4, 8, 16, 32]
|
9 |
+
WINDOW_SIZE: 12
|
10 |
+
APE: False
|
11 |
+
DROP_PATH_RATE: 0.3
|
12 |
+
PATCH_NORM: True
|
13 |
+
PRETRAIN_IMG_SIZE: 384
|
14 |
+
WEIGHTS: "swin_base_patch4_window12_384.pkl"
|
15 |
+
PIXEL_MEAN: [123.675, 116.280, 103.530]
|
16 |
+
PIXEL_STD: [58.395, 57.120, 57.375]
|
17 |
+
INPUT:
|
18 |
+
MIN_SIZE_TRAIN: !!python/object/apply:eval ["[int(x * 0.1 * 640) for x in range(5, 21)]"]
|
19 |
+
MIN_SIZE_TRAIN_SAMPLING: "choice"
|
20 |
+
MIN_SIZE_TEST: 640
|
21 |
+
MAX_SIZE_TRAIN: 2560
|
22 |
+
MAX_SIZE_TEST: 2560
|
23 |
+
CROP:
|
24 |
+
ENABLED: True
|
25 |
+
TYPE: "absolute"
|
26 |
+
SIZE: (640, 640)
|
27 |
+
SINGLE_CATEGORY_MAX_AREA: 1.0
|
28 |
+
COLOR_AUG_SSD: True
|
29 |
+
SIZE_DIVISIBILITY: 640 # used in dataset mapper
|
30 |
+
FORMAT: "RGB"
|
31 |
+
TEST:
|
32 |
+
EVAL_PERIOD: 5000
|
33 |
+
AUG:
|
34 |
+
ENABLED: False
|
35 |
+
MIN_SIZES: [320, 480, 640, 800, 960, 1120]
|
36 |
+
MAX_SIZE: 4480
|
37 |
+
FLIP: True
|
mask2former/configs/ade20k/semantic-segmentation/swin/maskformer2_swin_base_IN21k_384_bs16_160k_res640.yaml
ADDED
@@ -0,0 +1,37 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
_BASE_: ../maskformer2_R50_bs16_160k.yaml
|
2 |
+
MODEL:
|
3 |
+
BACKBONE:
|
4 |
+
NAME: "D2SwinTransformer"
|
5 |
+
SWIN:
|
6 |
+
EMBED_DIM: 128
|
7 |
+
DEPTHS: [2, 2, 18, 2]
|
8 |
+
NUM_HEADS: [4, 8, 16, 32]
|
9 |
+
WINDOW_SIZE: 12
|
10 |
+
APE: False
|
11 |
+
DROP_PATH_RATE: 0.3
|
12 |
+
PATCH_NORM: True
|
13 |
+
PRETRAIN_IMG_SIZE: 384
|
14 |
+
WEIGHTS: "swin_base_patch4_window12_384_22k.pkl"
|
15 |
+
PIXEL_MEAN: [123.675, 116.280, 103.530]
|
16 |
+
PIXEL_STD: [58.395, 57.120, 57.375]
|
17 |
+
INPUT:
|
18 |
+
MIN_SIZE_TRAIN: !!python/object/apply:eval ["[int(x * 0.1 * 640) for x in range(5, 21)]"]
|
19 |
+
MIN_SIZE_TRAIN_SAMPLING: "choice"
|
20 |
+
MIN_SIZE_TEST: 640
|
21 |
+
MAX_SIZE_TRAIN: 2560
|
22 |
+
MAX_SIZE_TEST: 2560
|
23 |
+
CROP:
|
24 |
+
ENABLED: True
|
25 |
+
TYPE: "absolute"
|
26 |
+
SIZE: (640, 640)
|
27 |
+
SINGLE_CATEGORY_MAX_AREA: 1.0
|
28 |
+
COLOR_AUG_SSD: True
|
29 |
+
SIZE_DIVISIBILITY: 640 # used in dataset mapper
|
30 |
+
FORMAT: "RGB"
|
31 |
+
TEST:
|
32 |
+
EVAL_PERIOD: 5000
|
33 |
+
AUG:
|
34 |
+
ENABLED: False
|
35 |
+
MIN_SIZES: [320, 480, 640, 800, 960, 1120]
|
36 |
+
MAX_SIZE: 4480
|
37 |
+
FLIP: True
|
mask2former/configs/ade20k/semantic-segmentation/swin/maskformer2_swin_large_IN21k_384_bs16_160k_res640.yaml
ADDED
@@ -0,0 +1,37 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
_BASE_: ../maskformer2_R50_bs16_160k.yaml
|
2 |
+
MODEL:
|
3 |
+
BACKBONE:
|
4 |
+
NAME: "D2SwinTransformer"
|
5 |
+
SWIN:
|
6 |
+
EMBED_DIM: 192
|
7 |
+
DEPTHS: [2, 2, 18, 2]
|
8 |
+
NUM_HEADS: [6, 12, 24, 48]
|
9 |
+
WINDOW_SIZE: 12
|
10 |
+
APE: False
|
11 |
+
DROP_PATH_RATE: 0.3
|
12 |
+
PATCH_NORM: True
|
13 |
+
PRETRAIN_IMG_SIZE: 384
|
14 |
+
WEIGHTS: "swin_large_patch4_window12_384_22k.pkl"
|
15 |
+
PIXEL_MEAN: [123.675, 116.280, 103.530]
|
16 |
+
PIXEL_STD: [58.395, 57.120, 57.375]
|
17 |
+
INPUT:
|
18 |
+
MIN_SIZE_TRAIN: !!python/object/apply:eval ["[int(x * 0.1 * 640) for x in range(5, 21)]"]
|
19 |
+
MIN_SIZE_TRAIN_SAMPLING: "choice"
|
20 |
+
MIN_SIZE_TEST: 640
|
21 |
+
MAX_SIZE_TRAIN: 2560
|
22 |
+
MAX_SIZE_TEST: 2560
|
23 |
+
CROP:
|
24 |
+
ENABLED: True
|
25 |
+
TYPE: "absolute"
|
26 |
+
SIZE: (640, 640)
|
27 |
+
SINGLE_CATEGORY_MAX_AREA: 1.0
|
28 |
+
COLOR_AUG_SSD: True
|
29 |
+
SIZE_DIVISIBILITY: 640 # used in dataset mapper
|
30 |
+
FORMAT: "RGB"
|
31 |
+
TEST:
|
32 |
+
EVAL_PERIOD: 5000
|
33 |
+
AUG:
|
34 |
+
ENABLED: False
|
35 |
+
MIN_SIZES: [320, 480, 640, 800, 960, 1120]
|
36 |
+
MAX_SIZE: 4480
|
37 |
+
FLIP: True
|
mask2former/configs/ade20k/semantic-segmentation/swin/maskformer2_swin_small_bs16_160k.yaml
ADDED
@@ -0,0 +1,15 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
_BASE_: ../maskformer2_R50_bs16_160k.yaml
|
2 |
+
MODEL:
|
3 |
+
BACKBONE:
|
4 |
+
NAME: "D2SwinTransformer"
|
5 |
+
SWIN:
|
6 |
+
EMBED_DIM: 96
|
7 |
+
DEPTHS: [2, 2, 18, 2]
|
8 |
+
NUM_HEADS: [3, 6, 12, 24]
|
9 |
+
WINDOW_SIZE: 7
|
10 |
+
APE: False
|
11 |
+
DROP_PATH_RATE: 0.3
|
12 |
+
PATCH_NORM: True
|
13 |
+
WEIGHTS: "swin_small_patch4_window7_224.pkl"
|
14 |
+
PIXEL_MEAN: [123.675, 116.280, 103.530]
|
15 |
+
PIXEL_STD: [58.395, 57.120, 57.375]
|
mask2former/configs/ade20k/semantic-segmentation/swin/maskformer2_swin_tiny_bs16_160k.yaml
ADDED
@@ -0,0 +1,15 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
_BASE_: ../maskformer2_R50_bs16_160k.yaml
|
2 |
+
MODEL:
|
3 |
+
BACKBONE:
|
4 |
+
NAME: "D2SwinTransformer"
|
5 |
+
SWIN:
|
6 |
+
EMBED_DIM: 96
|
7 |
+
DEPTHS: [2, 2, 6, 2]
|
8 |
+
NUM_HEADS: [3, 6, 12, 24]
|
9 |
+
WINDOW_SIZE: 7
|
10 |
+
APE: False
|
11 |
+
DROP_PATH_RATE: 0.3
|
12 |
+
PATCH_NORM: True
|
13 |
+
WEIGHTS: "swin_tiny_patch4_window7_224.pkl"
|
14 |
+
PIXEL_MEAN: [123.675, 116.280, 103.530]
|
15 |
+
PIXEL_STD: [58.395, 57.120, 57.375]
|
mask2former/configs/cityscapes/instance-segmentation/Base-Cityscapes-InstanceSegmentation.yaml
ADDED
@@ -0,0 +1,61 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
MODEL:
|
2 |
+
BACKBONE:
|
3 |
+
FREEZE_AT: 0
|
4 |
+
NAME: "build_resnet_backbone"
|
5 |
+
WEIGHTS: "detectron2://ImageNetPretrained/torchvision/R-50.pkl"
|
6 |
+
PIXEL_MEAN: [123.675, 116.280, 103.530]
|
7 |
+
PIXEL_STD: [58.395, 57.120, 57.375]
|
8 |
+
RESNETS:
|
9 |
+
DEPTH: 50
|
10 |
+
STEM_TYPE: "basic" # not used
|
11 |
+
STEM_OUT_CHANNELS: 64
|
12 |
+
STRIDE_IN_1X1: False
|
13 |
+
OUT_FEATURES: ["res2", "res3", "res4", "res5"]
|
14 |
+
NORM: "SyncBN" # use syncbn for cityscapes dataset
|
15 |
+
RES5_MULTI_GRID: [1, 1, 1] # not used
|
16 |
+
DATASETS:
|
17 |
+
TRAIN: ("cityscapes_fine_instance_seg_train",)
|
18 |
+
TEST: ("cityscapes_fine_instance_seg_val",)
|
19 |
+
SOLVER:
|
20 |
+
IMS_PER_BATCH: 16
|
21 |
+
BASE_LR: 0.0001
|
22 |
+
MAX_ITER: 90000
|
23 |
+
WARMUP_FACTOR: 1.0
|
24 |
+
WARMUP_ITERS: 0
|
25 |
+
WEIGHT_DECAY: 0.05
|
26 |
+
OPTIMIZER: "ADAMW"
|
27 |
+
LR_SCHEDULER_NAME: "WarmupPolyLR"
|
28 |
+
BACKBONE_MULTIPLIER: 0.1
|
29 |
+
CLIP_GRADIENTS:
|
30 |
+
ENABLED: True
|
31 |
+
CLIP_TYPE: "full_model"
|
32 |
+
CLIP_VALUE: 0.01
|
33 |
+
NORM_TYPE: 2.0
|
34 |
+
AMP:
|
35 |
+
ENABLED: True
|
36 |
+
INPUT:
|
37 |
+
MIN_SIZE_TRAIN: !!python/object/apply:eval ["[int(x * 0.1 * 1024) for x in range(5, 21)]"]
|
38 |
+
MIN_SIZE_TRAIN_SAMPLING: "choice"
|
39 |
+
MIN_SIZE_TEST: 1024
|
40 |
+
MAX_SIZE_TRAIN: 4096
|
41 |
+
MAX_SIZE_TEST: 2048
|
42 |
+
CROP:
|
43 |
+
ENABLED: True
|
44 |
+
TYPE: "absolute"
|
45 |
+
SIZE: (512, 1024)
|
46 |
+
SINGLE_CATEGORY_MAX_AREA: 1.0
|
47 |
+
COLOR_AUG_SSD: True
|
48 |
+
SIZE_DIVISIBILITY: -1
|
49 |
+
FORMAT: "RGB"
|
50 |
+
DATASET_MAPPER_NAME: "mask_former_instance"
|
51 |
+
TEST:
|
52 |
+
EVAL_PERIOD: 5000
|
53 |
+
AUG:
|
54 |
+
ENABLED: False
|
55 |
+
MIN_SIZES: [512, 768, 1024, 1280, 1536, 1792]
|
56 |
+
MAX_SIZE: 4096
|
57 |
+
FLIP: True
|
58 |
+
DATALOADER:
|
59 |
+
FILTER_EMPTY_ANNOTATIONS: True
|
60 |
+
NUM_WORKERS: 4
|
61 |
+
VERSION: 2
|
mask2former/configs/cityscapes/instance-segmentation/maskformer2_R101_bs16_90k.yaml
ADDED
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
_BASE_: maskformer2_R50_bs16_90k.yaml
|
2 |
+
MODEL:
|
3 |
+
WEIGHTS: "R-101.pkl"
|
4 |
+
RESNETS:
|
5 |
+
DEPTH: 101
|
6 |
+
STEM_TYPE: "basic" # not used
|
7 |
+
STEM_OUT_CHANNELS: 64
|
8 |
+
STRIDE_IN_1X1: False
|
9 |
+
OUT_FEATURES: ["res2", "res3", "res4", "res5"]
|
10 |
+
NORM: "SyncBN"
|
11 |
+
RES5_MULTI_GRID: [1, 1, 1] # not used
|
mask2former/configs/cityscapes/instance-segmentation/maskformer2_R50_bs16_90k.yaml
ADDED
@@ -0,0 +1,44 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
_BASE_: Base-Cityscapes-InstanceSegmentation.yaml
|
2 |
+
MODEL:
|
3 |
+
META_ARCHITECTURE: "MaskFormer"
|
4 |
+
SEM_SEG_HEAD:
|
5 |
+
NAME: "MaskFormerHead"
|
6 |
+
IGNORE_VALUE: 255
|
7 |
+
NUM_CLASSES: 8
|
8 |
+
LOSS_WEIGHT: 1.0
|
9 |
+
CONVS_DIM: 256
|
10 |
+
MASK_DIM: 256
|
11 |
+
NORM: "GN"
|
12 |
+
# pixel decoder
|
13 |
+
PIXEL_DECODER_NAME: "MSDeformAttnPixelDecoder"
|
14 |
+
IN_FEATURES: ["res2", "res3", "res4", "res5"]
|
15 |
+
DEFORMABLE_TRANSFORMER_ENCODER_IN_FEATURES: ["res3", "res4", "res5"]
|
16 |
+
COMMON_STRIDE: 4
|
17 |
+
TRANSFORMER_ENC_LAYERS: 6
|
18 |
+
MASK_FORMER:
|
19 |
+
TRANSFORMER_DECODER_NAME: "MultiScaleMaskedTransformerDecoder"
|
20 |
+
TRANSFORMER_IN_FEATURE: "multi_scale_pixel_decoder"
|
21 |
+
DEEP_SUPERVISION: True
|
22 |
+
NO_OBJECT_WEIGHT: 0.1
|
23 |
+
CLASS_WEIGHT: 2.0
|
24 |
+
MASK_WEIGHT: 5.0
|
25 |
+
DICE_WEIGHT: 5.0
|
26 |
+
HIDDEN_DIM: 256
|
27 |
+
NUM_OBJECT_QUERIES: 100
|
28 |
+
NHEADS: 8
|
29 |
+
DROPOUT: 0.0
|
30 |
+
DIM_FEEDFORWARD: 2048
|
31 |
+
ENC_LAYERS: 0
|
32 |
+
PRE_NORM: False
|
33 |
+
ENFORCE_INPUT_PROJ: False
|
34 |
+
SIZE_DIVISIBILITY: 32
|
35 |
+
DEC_LAYERS: 10 # 9 decoder layers, add one for the loss on learnable query
|
36 |
+
TRAIN_NUM_POINTS: 12544
|
37 |
+
OVERSAMPLE_RATIO: 3.0
|
38 |
+
IMPORTANCE_SAMPLE_RATIO: 0.75
|
39 |
+
TEST:
|
40 |
+
SEMANTIC_ON: False
|
41 |
+
INSTANCE_ON: True
|
42 |
+
PANOPTIC_ON: False
|
43 |
+
OVERLAP_THRESHOLD: 0.8
|
44 |
+
OBJECT_MASK_THRESHOLD: 0.8
|
mask2former/configs/cityscapes/instance-segmentation/swin/maskformer2_swin_base_IN21k_384_bs16_90k.yaml
ADDED
@@ -0,0 +1,16 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
_BASE_: ../maskformer2_R50_bs16_90k.yaml
|
2 |
+
MODEL:
|
3 |
+
BACKBONE:
|
4 |
+
NAME: "D2SwinTransformer"
|
5 |
+
SWIN:
|
6 |
+
EMBED_DIM: 128
|
7 |
+
DEPTHS: [2, 2, 18, 2]
|
8 |
+
NUM_HEADS: [4, 8, 16, 32]
|
9 |
+
WINDOW_SIZE: 12
|
10 |
+
APE: False
|
11 |
+
DROP_PATH_RATE: 0.3
|
12 |
+
PATCH_NORM: True
|
13 |
+
PRETRAIN_IMG_SIZE: 384
|
14 |
+
WEIGHTS: "swin_base_patch4_window12_384_22k.pkl"
|
15 |
+
PIXEL_MEAN: [123.675, 116.280, 103.530]
|
16 |
+
PIXEL_STD: [58.395, 57.120, 57.375]
|
mask2former/configs/cityscapes/instance-segmentation/swin/maskformer2_swin_large_IN21k_384_bs16_90k.yaml
ADDED
@@ -0,0 +1,18 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
_BASE_: ../maskformer2_R50_bs16_90k.yaml
|
2 |
+
MODEL:
|
3 |
+
BACKBONE:
|
4 |
+
NAME: "D2SwinTransformer"
|
5 |
+
SWIN:
|
6 |
+
EMBED_DIM: 192
|
7 |
+
DEPTHS: [2, 2, 18, 2]
|
8 |
+
NUM_HEADS: [6, 12, 24, 48]
|
9 |
+
WINDOW_SIZE: 12
|
10 |
+
APE: False
|
11 |
+
DROP_PATH_RATE: 0.3
|
12 |
+
PATCH_NORM: True
|
13 |
+
PRETRAIN_IMG_SIZE: 384
|
14 |
+
WEIGHTS: "swin_large_patch4_window12_384_22k.pkl"
|
15 |
+
PIXEL_MEAN: [123.675, 116.280, 103.530]
|
16 |
+
PIXEL_STD: [58.395, 57.120, 57.375]
|
17 |
+
MASK_FORMER:
|
18 |
+
NUM_OBJECT_QUERIES: 200
|
mask2former/configs/cityscapes/instance-segmentation/swin/maskformer2_swin_small_bs16_90k.yaml
ADDED
@@ -0,0 +1,15 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
_BASE_: ../maskformer2_R50_bs16_90k.yaml
|
2 |
+
MODEL:
|
3 |
+
BACKBONE:
|
4 |
+
NAME: "D2SwinTransformer"
|
5 |
+
SWIN:
|
6 |
+
EMBED_DIM: 96
|
7 |
+
DEPTHS: [2, 2, 18, 2]
|
8 |
+
NUM_HEADS: [3, 6, 12, 24]
|
9 |
+
WINDOW_SIZE: 7
|
10 |
+
APE: False
|
11 |
+
DROP_PATH_RATE: 0.3
|
12 |
+
PATCH_NORM: True
|
13 |
+
WEIGHTS: "swin_small_patch4_window7_224.pkl"
|
14 |
+
PIXEL_MEAN: [123.675, 116.280, 103.530]
|
15 |
+
PIXEL_STD: [58.395, 57.120, 57.375]
|
mask2former/configs/cityscapes/instance-segmentation/swin/maskformer2_swin_tiny_bs16_90k.yaml
ADDED
@@ -0,0 +1,15 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
_BASE_: ../maskformer2_R50_bs16_90k.yaml
|
2 |
+
MODEL:
|
3 |
+
BACKBONE:
|
4 |
+
NAME: "D2SwinTransformer"
|
5 |
+
SWIN:
|
6 |
+
EMBED_DIM: 96
|
7 |
+
DEPTHS: [2, 2, 6, 2]
|
8 |
+
NUM_HEADS: [3, 6, 12, 24]
|
9 |
+
WINDOW_SIZE: 7
|
10 |
+
APE: False
|
11 |
+
DROP_PATH_RATE: 0.3
|
12 |
+
PATCH_NORM: True
|
13 |
+
WEIGHTS: "swin_tiny_patch4_window7_224.pkl"
|
14 |
+
PIXEL_MEAN: [123.675, 116.280, 103.530]
|
15 |
+
PIXEL_STD: [58.395, 57.120, 57.375]
|
mask2former/configs/cityscapes/panoptic-segmentation/Base-Cityscapes-PanopticSegmentation.yaml
ADDED
@@ -0,0 +1,61 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
MODEL:
|
2 |
+
BACKBONE:
|
3 |
+
FREEZE_AT: 0
|
4 |
+
NAME: "build_resnet_backbone"
|
5 |
+
WEIGHTS: "detectron2://ImageNetPretrained/torchvision/R-50.pkl"
|
6 |
+
PIXEL_MEAN: [123.675, 116.280, 103.530]
|
7 |
+
PIXEL_STD: [58.395, 57.120, 57.375]
|
8 |
+
RESNETS:
|
9 |
+
DEPTH: 50
|
10 |
+
STEM_TYPE: "basic" # not used
|
11 |
+
STEM_OUT_CHANNELS: 64
|
12 |
+
STRIDE_IN_1X1: False
|
13 |
+
OUT_FEATURES: ["res2", "res3", "res4", "res5"]
|
14 |
+
NORM: "SyncBN" # use syncbn for cityscapes dataset
|
15 |
+
RES5_MULTI_GRID: [1, 1, 1] # not used
|
16 |
+
DATASETS:
|
17 |
+
TRAIN: ("cityscapes_fine_panoptic_train",)
|
18 |
+
TEST: ("cityscapes_fine_panoptic_val",)
|
19 |
+
SOLVER:
|
20 |
+
IMS_PER_BATCH: 16
|
21 |
+
BASE_LR: 0.0001
|
22 |
+
MAX_ITER: 90000
|
23 |
+
WARMUP_FACTOR: 1.0
|
24 |
+
WARMUP_ITERS: 0
|
25 |
+
WEIGHT_DECAY: 0.05
|
26 |
+
OPTIMIZER: "ADAMW"
|
27 |
+
LR_SCHEDULER_NAME: "WarmupPolyLR"
|
28 |
+
BACKBONE_MULTIPLIER: 0.1
|
29 |
+
CLIP_GRADIENTS:
|
30 |
+
ENABLED: True
|
31 |
+
CLIP_TYPE: "full_model"
|
32 |
+
CLIP_VALUE: 0.01
|
33 |
+
NORM_TYPE: 2.0
|
34 |
+
AMP:
|
35 |
+
ENABLED: True
|
36 |
+
INPUT:
|
37 |
+
MIN_SIZE_TRAIN: !!python/object/apply:eval ["[int(x * 0.1 * 1024) for x in range(5, 21)]"]
|
38 |
+
MIN_SIZE_TRAIN_SAMPLING: "choice"
|
39 |
+
MIN_SIZE_TEST: 1024
|
40 |
+
MAX_SIZE_TRAIN: 4096
|
41 |
+
MAX_SIZE_TEST: 2048
|
42 |
+
CROP:
|
43 |
+
ENABLED: True
|
44 |
+
TYPE: "absolute"
|
45 |
+
SIZE: (512, 1024)
|
46 |
+
SINGLE_CATEGORY_MAX_AREA: 1.0
|
47 |
+
COLOR_AUG_SSD: True
|
48 |
+
SIZE_DIVISIBILITY: -1
|
49 |
+
FORMAT: "RGB"
|
50 |
+
DATASET_MAPPER_NAME: "mask_former_panoptic"
|
51 |
+
TEST:
|
52 |
+
EVAL_PERIOD: 5000
|
53 |
+
AUG:
|
54 |
+
ENABLED: False
|
55 |
+
MIN_SIZES: [512, 768, 1024, 1280, 1536, 1792]
|
56 |
+
MAX_SIZE: 4096
|
57 |
+
FLIP: True
|
58 |
+
DATALOADER:
|
59 |
+
FILTER_EMPTY_ANNOTATIONS: True
|
60 |
+
NUM_WORKERS: 4
|
61 |
+
VERSION: 2
|
mask2former/configs/cityscapes/panoptic-segmentation/maskformer2_R101_bs16_90k.yaml
ADDED
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
_BASE_: maskformer2_R50_bs16_90k.yaml
|
2 |
+
MODEL:
|
3 |
+
WEIGHTS: "R-101.pkl"
|
4 |
+
RESNETS:
|
5 |
+
DEPTH: 101
|
6 |
+
STEM_TYPE: "basic" # not used
|
7 |
+
STEM_OUT_CHANNELS: 64
|
8 |
+
STRIDE_IN_1X1: False
|
9 |
+
OUT_FEATURES: ["res2", "res3", "res4", "res5"]
|
10 |
+
NORM: "SyncBN"
|
11 |
+
RES5_MULTI_GRID: [1, 1, 1] # not used
|
mask2former/configs/cityscapes/panoptic-segmentation/maskformer2_R50_bs16_90k.yaml
ADDED
@@ -0,0 +1,44 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
_BASE_: Base-Cityscapes-PanopticSegmentation.yaml
|
2 |
+
MODEL:
|
3 |
+
META_ARCHITECTURE: "MaskFormer"
|
4 |
+
SEM_SEG_HEAD:
|
5 |
+
NAME: "MaskFormerHead"
|
6 |
+
IGNORE_VALUE: 255
|
7 |
+
NUM_CLASSES: 19
|
8 |
+
LOSS_WEIGHT: 1.0
|
9 |
+
CONVS_DIM: 256
|
10 |
+
MASK_DIM: 256
|
11 |
+
NORM: "GN"
|
12 |
+
# pixel decoder
|
13 |
+
PIXEL_DECODER_NAME: "MSDeformAttnPixelDecoder"
|
14 |
+
IN_FEATURES: ["res2", "res3", "res4", "res5"]
|
15 |
+
DEFORMABLE_TRANSFORMER_ENCODER_IN_FEATURES: ["res3", "res4", "res5"]
|
16 |
+
COMMON_STRIDE: 4
|
17 |
+
TRANSFORMER_ENC_LAYERS: 6
|
18 |
+
MASK_FORMER:
|
19 |
+
TRANSFORMER_DECODER_NAME: "MultiScaleMaskedTransformerDecoder"
|
20 |
+
TRANSFORMER_IN_FEATURE: "multi_scale_pixel_decoder"
|
21 |
+
DEEP_SUPERVISION: True
|
22 |
+
NO_OBJECT_WEIGHT: 0.1
|
23 |
+
CLASS_WEIGHT: 2.0
|
24 |
+
MASK_WEIGHT: 5.0
|
25 |
+
DICE_WEIGHT: 5.0
|
26 |
+
HIDDEN_DIM: 256
|
27 |
+
NUM_OBJECT_QUERIES: 100
|
28 |
+
NHEADS: 8
|
29 |
+
DROPOUT: 0.0
|
30 |
+
DIM_FEEDFORWARD: 2048
|
31 |
+
ENC_LAYERS: 0
|
32 |
+
PRE_NORM: False
|
33 |
+
ENFORCE_INPUT_PROJ: False
|
34 |
+
SIZE_DIVISIBILITY: 32
|
35 |
+
DEC_LAYERS: 10 # 9 decoder layers, add one for the loss on learnable query
|
36 |
+
TRAIN_NUM_POINTS: 12544
|
37 |
+
OVERSAMPLE_RATIO: 3.0
|
38 |
+
IMPORTANCE_SAMPLE_RATIO: 0.75
|
39 |
+
TEST:
|
40 |
+
SEMANTIC_ON: True
|
41 |
+
INSTANCE_ON: True
|
42 |
+
PANOPTIC_ON: True
|
43 |
+
OVERLAP_THRESHOLD: 0.8
|
44 |
+
OBJECT_MASK_THRESHOLD: 0.8
|
mask2former/configs/cityscapes/panoptic-segmentation/swin/maskformer2_swin_base_IN21k_384_bs16_90k.yaml
ADDED
@@ -0,0 +1,16 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
_BASE_: ../maskformer2_R50_bs16_90k.yaml
|
2 |
+
MODEL:
|
3 |
+
BACKBONE:
|
4 |
+
NAME: "D2SwinTransformer"
|
5 |
+
SWIN:
|
6 |
+
EMBED_DIM: 128
|
7 |
+
DEPTHS: [2, 2, 18, 2]
|
8 |
+
NUM_HEADS: [4, 8, 16, 32]
|
9 |
+
WINDOW_SIZE: 12
|
10 |
+
APE: False
|
11 |
+
DROP_PATH_RATE: 0.3
|
12 |
+
PATCH_NORM: True
|
13 |
+
PRETRAIN_IMG_SIZE: 384
|
14 |
+
WEIGHTS: "swin_base_patch4_window12_384_22k.pkl"
|
15 |
+
PIXEL_MEAN: [123.675, 116.280, 103.530]
|
16 |
+
PIXEL_STD: [58.395, 57.120, 57.375]
|
mask2former/configs/cityscapes/panoptic-segmentation/swin/maskformer2_swin_large_IN21k_384_bs16_90k.yaml
ADDED
@@ -0,0 +1,18 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
_BASE_: ../maskformer2_R50_bs16_90k.yaml
|
2 |
+
MODEL:
|
3 |
+
BACKBONE:
|
4 |
+
NAME: "D2SwinTransformer"
|
5 |
+
SWIN:
|
6 |
+
EMBED_DIM: 192
|
7 |
+
DEPTHS: [2, 2, 18, 2]
|
8 |
+
NUM_HEADS: [6, 12, 24, 48]
|
9 |
+
WINDOW_SIZE: 12
|
10 |
+
APE: False
|
11 |
+
DROP_PATH_RATE: 0.3
|
12 |
+
PATCH_NORM: True
|
13 |
+
PRETRAIN_IMG_SIZE: 384
|
14 |
+
WEIGHTS: "swin_large_patch4_window12_384_22k.pkl"
|
15 |
+
PIXEL_MEAN: [123.675, 116.280, 103.530]
|
16 |
+
PIXEL_STD: [58.395, 57.120, 57.375]
|
17 |
+
MASK_FORMER:
|
18 |
+
NUM_OBJECT_QUERIES: 200
|
mask2former/configs/cityscapes/panoptic-segmentation/swin/maskformer2_swin_small_bs16_90k.yaml
ADDED
@@ -0,0 +1,15 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
_BASE_: ../maskformer2_R50_bs16_90k.yaml
|
2 |
+
MODEL:
|
3 |
+
BACKBONE:
|
4 |
+
NAME: "D2SwinTransformer"
|
5 |
+
SWIN:
|
6 |
+
EMBED_DIM: 96
|
7 |
+
DEPTHS: [2, 2, 18, 2]
|
8 |
+
NUM_HEADS: [3, 6, 12, 24]
|
9 |
+
WINDOW_SIZE: 7
|
10 |
+
APE: False
|
11 |
+
DROP_PATH_RATE: 0.3
|
12 |
+
PATCH_NORM: True
|
13 |
+
WEIGHTS: "swin_small_patch4_window7_224.pkl"
|
14 |
+
PIXEL_MEAN: [123.675, 116.280, 103.530]
|
15 |
+
PIXEL_STD: [58.395, 57.120, 57.375]
|
mask2former/configs/cityscapes/panoptic-segmentation/swin/maskformer2_swin_tiny_bs16_90k.yaml
ADDED
@@ -0,0 +1,15 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
_BASE_: ../maskformer2_R50_bs16_90k.yaml
|
2 |
+
MODEL:
|
3 |
+
BACKBONE:
|
4 |
+
NAME: "D2SwinTransformer"
|
5 |
+
SWIN:
|
6 |
+
EMBED_DIM: 96
|
7 |
+
DEPTHS: [2, 2, 6, 2]
|
8 |
+
NUM_HEADS: [3, 6, 12, 24]
|
9 |
+
WINDOW_SIZE: 7
|
10 |
+
APE: False
|
11 |
+
DROP_PATH_RATE: 0.3
|
12 |
+
PATCH_NORM: True
|
13 |
+
WEIGHTS: "swin_tiny_patch4_window7_224.pkl"
|
14 |
+
PIXEL_MEAN: [123.675, 116.280, 103.530]
|
15 |
+
PIXEL_STD: [58.395, 57.120, 57.375]
|
mask2former/configs/cityscapes/semantic-segmentation/Base-Cityscapes-SemanticSegmentation.yaml
ADDED
@@ -0,0 +1,61 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
MODEL:
|
2 |
+
BACKBONE:
|
3 |
+
FREEZE_AT: 0
|
4 |
+
NAME: "build_resnet_backbone"
|
5 |
+
WEIGHTS: "detectron2://ImageNetPretrained/torchvision/R-50.pkl"
|
6 |
+
PIXEL_MEAN: [123.675, 116.280, 103.530]
|
7 |
+
PIXEL_STD: [58.395, 57.120, 57.375]
|
8 |
+
RESNETS:
|
9 |
+
DEPTH: 50
|
10 |
+
STEM_TYPE: "basic" # not used
|
11 |
+
STEM_OUT_CHANNELS: 64
|
12 |
+
STRIDE_IN_1X1: False
|
13 |
+
OUT_FEATURES: ["res2", "res3", "res4", "res5"]
|
14 |
+
NORM: "SyncBN" # use syncbn for cityscapes dataset
|
15 |
+
RES5_MULTI_GRID: [1, 1, 1] # not used
|
16 |
+
DATASETS:
|
17 |
+
TRAIN: ("cityscapes_fine_sem_seg_train",)
|
18 |
+
TEST: ("cityscapes_fine_sem_seg_val",)
|
19 |
+
SOLVER:
|
20 |
+
IMS_PER_BATCH: 16
|
21 |
+
BASE_LR: 0.0001
|
22 |
+
MAX_ITER: 90000
|
23 |
+
WARMUP_FACTOR: 1.0
|
24 |
+
WARMUP_ITERS: 0
|
25 |
+
WEIGHT_DECAY: 0.05
|
26 |
+
OPTIMIZER: "ADAMW"
|
27 |
+
LR_SCHEDULER_NAME: "WarmupPolyLR"
|
28 |
+
BACKBONE_MULTIPLIER: 0.1
|
29 |
+
CLIP_GRADIENTS:
|
30 |
+
ENABLED: True
|
31 |
+
CLIP_TYPE: "full_model"
|
32 |
+
CLIP_VALUE: 0.01
|
33 |
+
NORM_TYPE: 2.0
|
34 |
+
AMP:
|
35 |
+
ENABLED: True
|
36 |
+
INPUT:
|
37 |
+
MIN_SIZE_TRAIN: !!python/object/apply:eval ["[int(x * 0.1 * 1024) for x in range(5, 21)]"]
|
38 |
+
MIN_SIZE_TRAIN_SAMPLING: "choice"
|
39 |
+
MIN_SIZE_TEST: 1024
|
40 |
+
MAX_SIZE_TRAIN: 4096
|
41 |
+
MAX_SIZE_TEST: 2048
|
42 |
+
CROP:
|
43 |
+
ENABLED: True
|
44 |
+
TYPE: "absolute"
|
45 |
+
SIZE: (512, 1024)
|
46 |
+
SINGLE_CATEGORY_MAX_AREA: 1.0
|
47 |
+
COLOR_AUG_SSD: True
|
48 |
+
SIZE_DIVISIBILITY: -1
|
49 |
+
FORMAT: "RGB"
|
50 |
+
DATASET_MAPPER_NAME: "mask_former_semantic"
|
51 |
+
TEST:
|
52 |
+
EVAL_PERIOD: 5000
|
53 |
+
AUG:
|
54 |
+
ENABLED: False
|
55 |
+
MIN_SIZES: [512, 768, 1024, 1280, 1536, 1792]
|
56 |
+
MAX_SIZE: 4096
|
57 |
+
FLIP: True
|
58 |
+
DATALOADER:
|
59 |
+
FILTER_EMPTY_ANNOTATIONS: True
|
60 |
+
NUM_WORKERS: 4
|
61 |
+
VERSION: 2
|
mask2former/configs/cityscapes/semantic-segmentation/maskformer2_R101_bs16_90k.yaml
ADDED
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
_BASE_: maskformer2_R50_bs16_90k.yaml
|
2 |
+
MODEL:
|
3 |
+
WEIGHTS: "R-101.pkl"
|
4 |
+
RESNETS:
|
5 |
+
DEPTH: 101
|
6 |
+
STEM_TYPE: "basic" # not used
|
7 |
+
STEM_OUT_CHANNELS: 64
|
8 |
+
STRIDE_IN_1X1: False
|
9 |
+
OUT_FEATURES: ["res2", "res3", "res4", "res5"]
|
10 |
+
NORM: "SyncBN"
|
11 |
+
RES5_MULTI_GRID: [1, 1, 1] # not used
|
mask2former/configs/cityscapes/semantic-segmentation/maskformer2_R50_bs16_90k.yaml
ADDED
@@ -0,0 +1,44 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
_BASE_: Base-Cityscapes-SemanticSegmentation.yaml
|
2 |
+
MODEL:
|
3 |
+
META_ARCHITECTURE: "MaskFormer"
|
4 |
+
SEM_SEG_HEAD:
|
5 |
+
NAME: "MaskFormerHead"
|
6 |
+
IGNORE_VALUE: 255
|
7 |
+
NUM_CLASSES: 19
|
8 |
+
LOSS_WEIGHT: 1.0
|
9 |
+
CONVS_DIM: 256
|
10 |
+
MASK_DIM: 256
|
11 |
+
NORM: "GN"
|
12 |
+
# pixel decoder
|
13 |
+
PIXEL_DECODER_NAME: "MSDeformAttnPixelDecoder"
|
14 |
+
IN_FEATURES: ["res2", "res3", "res4", "res5"]
|
15 |
+
DEFORMABLE_TRANSFORMER_ENCODER_IN_FEATURES: ["res3", "res4", "res5"]
|
16 |
+
COMMON_STRIDE: 4
|
17 |
+
TRANSFORMER_ENC_LAYERS: 6
|
18 |
+
MASK_FORMER:
|
19 |
+
TRANSFORMER_DECODER_NAME: "MultiScaleMaskedTransformerDecoder"
|
20 |
+
TRANSFORMER_IN_FEATURE: "multi_scale_pixel_decoder"
|
21 |
+
DEEP_SUPERVISION: True
|
22 |
+
NO_OBJECT_WEIGHT: 0.1
|
23 |
+
CLASS_WEIGHT: 2.0
|
24 |
+
MASK_WEIGHT: 5.0
|
25 |
+
DICE_WEIGHT: 5.0
|
26 |
+
HIDDEN_DIM: 256
|
27 |
+
NUM_OBJECT_QUERIES: 100
|
28 |
+
NHEADS: 8
|
29 |
+
DROPOUT: 0.0
|
30 |
+
DIM_FEEDFORWARD: 2048
|
31 |
+
ENC_LAYERS: 0
|
32 |
+
PRE_NORM: False
|
33 |
+
ENFORCE_INPUT_PROJ: False
|
34 |
+
SIZE_DIVISIBILITY: 32
|
35 |
+
DEC_LAYERS: 10 # 9 decoder layers, add one for the loss on learnable query
|
36 |
+
TRAIN_NUM_POINTS: 12544
|
37 |
+
OVERSAMPLE_RATIO: 3.0
|
38 |
+
IMPORTANCE_SAMPLE_RATIO: 0.75
|
39 |
+
TEST:
|
40 |
+
SEMANTIC_ON: True
|
41 |
+
INSTANCE_ON: False
|
42 |
+
PANOPTIC_ON: False
|
43 |
+
OVERLAP_THRESHOLD: 0.8
|
44 |
+
OBJECT_MASK_THRESHOLD: 0.8
|
mask2former/configs/cityscapes/semantic-segmentation/swin/maskformer2_swin_base_IN21k_384_bs16_90k.yaml
ADDED
@@ -0,0 +1,16 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
_BASE_: ../maskformer2_R50_bs16_90k.yaml
|
2 |
+
MODEL:
|
3 |
+
BACKBONE:
|
4 |
+
NAME: "D2SwinTransformer"
|
5 |
+
SWIN:
|
6 |
+
EMBED_DIM: 128
|
7 |
+
DEPTHS: [2, 2, 18, 2]
|
8 |
+
NUM_HEADS: [4, 8, 16, 32]
|
9 |
+
WINDOW_SIZE: 12
|
10 |
+
APE: False
|
11 |
+
DROP_PATH_RATE: 0.3
|
12 |
+
PATCH_NORM: True
|
13 |
+
PRETRAIN_IMG_SIZE: 384
|
14 |
+
WEIGHTS: "swin_base_patch4_window12_384_22k.pkl"
|
15 |
+
PIXEL_MEAN: [123.675, 116.280, 103.530]
|
16 |
+
PIXEL_STD: [58.395, 57.120, 57.375]
|
mask2former/configs/cityscapes/semantic-segmentation/swin/maskformer2_swin_large_IN21k_384_bs16_90k.yaml
ADDED
@@ -0,0 +1,18 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
_BASE_: ../maskformer2_R50_bs16_90k.yaml
|
2 |
+
MODEL:
|
3 |
+
BACKBONE:
|
4 |
+
NAME: "D2SwinTransformer"
|
5 |
+
SWIN:
|
6 |
+
EMBED_DIM: 192
|
7 |
+
DEPTHS: [2, 2, 18, 2]
|
8 |
+
NUM_HEADS: [6, 12, 24, 48]
|
9 |
+
WINDOW_SIZE: 12
|
10 |
+
APE: False
|
11 |
+
DROP_PATH_RATE: 0.3
|
12 |
+
PATCH_NORM: True
|
13 |
+
PRETRAIN_IMG_SIZE: 384
|
14 |
+
WEIGHTS: "swin_large_patch4_window12_384_22k.pkl"
|
15 |
+
PIXEL_MEAN: [123.675, 116.280, 103.530]
|
16 |
+
PIXEL_STD: [58.395, 57.120, 57.375]
|
17 |
+
MASK_FORMER:
|
18 |
+
NUM_OBJECT_QUERIES: 100
|
mask2former/configs/cityscapes/semantic-segmentation/swin/maskformer2_swin_small_bs16_90k.yaml
ADDED
@@ -0,0 +1,15 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
_BASE_: ../maskformer2_R50_bs16_90k.yaml
|
2 |
+
MODEL:
|
3 |
+
BACKBONE:
|
4 |
+
NAME: "D2SwinTransformer"
|
5 |
+
SWIN:
|
6 |
+
EMBED_DIM: 96
|
7 |
+
DEPTHS: [2, 2, 18, 2]
|
8 |
+
NUM_HEADS: [3, 6, 12, 24]
|
9 |
+
WINDOW_SIZE: 7
|
10 |
+
APE: False
|
11 |
+
DROP_PATH_RATE: 0.3
|
12 |
+
PATCH_NORM: True
|
13 |
+
WEIGHTS: "swin_small_patch4_window7_224.pkl"
|
14 |
+
PIXEL_MEAN: [123.675, 116.280, 103.530]
|
15 |
+
PIXEL_STD: [58.395, 57.120, 57.375]
|
mask2former/configs/cityscapes/semantic-segmentation/swin/maskformer2_swin_tiny_bs16_90k.yaml
ADDED
@@ -0,0 +1,15 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
_BASE_: ../maskformer2_R50_bs16_90k.yaml
|
2 |
+
MODEL:
|
3 |
+
BACKBONE:
|
4 |
+
NAME: "D2SwinTransformer"
|
5 |
+
SWIN:
|
6 |
+
EMBED_DIM: 96
|
7 |
+
DEPTHS: [2, 2, 6, 2]
|
8 |
+
NUM_HEADS: [3, 6, 12, 24]
|
9 |
+
WINDOW_SIZE: 7
|
10 |
+
APE: False
|
11 |
+
DROP_PATH_RATE: 0.3
|
12 |
+
PATCH_NORM: True
|
13 |
+
WEIGHTS: "swin_tiny_patch4_window7_224.pkl"
|
14 |
+
PIXEL_MEAN: [123.675, 116.280, 103.530]
|
15 |
+
PIXEL_STD: [58.395, 57.120, 57.375]
|
mask2former/configs/coco/instance-segmentation/Base-COCO-InstanceSegmentation.yaml
ADDED
@@ -0,0 +1,47 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
MODEL:
|
2 |
+
BACKBONE:
|
3 |
+
FREEZE_AT: 0
|
4 |
+
NAME: "build_resnet_backbone"
|
5 |
+
WEIGHTS: "detectron2://ImageNetPretrained/torchvision/R-50.pkl"
|
6 |
+
PIXEL_MEAN: [123.675, 116.280, 103.530]
|
7 |
+
PIXEL_STD: [58.395, 57.120, 57.375]
|
8 |
+
RESNETS:
|
9 |
+
DEPTH: 50
|
10 |
+
STEM_TYPE: "basic" # not used
|
11 |
+
STEM_OUT_CHANNELS: 64
|
12 |
+
STRIDE_IN_1X1: False
|
13 |
+
OUT_FEATURES: ["res2", "res3", "res4", "res5"]
|
14 |
+
# NORM: "SyncBN"
|
15 |
+
RES5_MULTI_GRID: [1, 1, 1] # not used
|
16 |
+
DATASETS:
|
17 |
+
TRAIN: ("coco_2017_train",)
|
18 |
+
TEST: ("coco_2017_val",)
|
19 |
+
SOLVER:
|
20 |
+
IMS_PER_BATCH: 16
|
21 |
+
BASE_LR: 0.0001
|
22 |
+
STEPS: (327778, 355092)
|
23 |
+
MAX_ITER: 368750
|
24 |
+
WARMUP_FACTOR: 1.0
|
25 |
+
WARMUP_ITERS: 10
|
26 |
+
WEIGHT_DECAY: 0.05
|
27 |
+
OPTIMIZER: "ADAMW"
|
28 |
+
BACKBONE_MULTIPLIER: 0.1
|
29 |
+
CLIP_GRADIENTS:
|
30 |
+
ENABLED: True
|
31 |
+
CLIP_TYPE: "full_model"
|
32 |
+
CLIP_VALUE: 0.01
|
33 |
+
NORM_TYPE: 2.0
|
34 |
+
AMP:
|
35 |
+
ENABLED: True
|
36 |
+
INPUT:
|
37 |
+
IMAGE_SIZE: 1024
|
38 |
+
MIN_SCALE: 0.1
|
39 |
+
MAX_SCALE: 2.0
|
40 |
+
FORMAT: "RGB"
|
41 |
+
DATASET_MAPPER_NAME: "coco_instance_lsj"
|
42 |
+
TEST:
|
43 |
+
EVAL_PERIOD: 5000
|
44 |
+
DATALOADER:
|
45 |
+
FILTER_EMPTY_ANNOTATIONS: True
|
46 |
+
NUM_WORKERS: 4
|
47 |
+
VERSION: 2
|
mask2former/configs/coco/instance-segmentation/maskformer2_R101_bs16_50ep.yaml
ADDED
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
_BASE_: maskformer2_R50_bs16_50ep.yaml
|
2 |
+
MODEL:
|
3 |
+
WEIGHTS: "R-101.pkl"
|
4 |
+
RESNETS:
|
5 |
+
DEPTH: 101
|
6 |
+
STEM_TYPE: "basic" # not used
|
7 |
+
STEM_OUT_CHANNELS: 64
|
8 |
+
STRIDE_IN_1X1: False
|
9 |
+
OUT_FEATURES: ["res2", "res3", "res4", "res5"]
|
10 |
+
# NORM: "SyncBN"
|
11 |
+
RES5_MULTI_GRID: [1, 1, 1] # not used
|