Files changed (1) hide show
  1. README.md +29 -1
README.md CHANGED
@@ -23,7 +23,7 @@ model-index:
23
  # Overview of Monocular depth estimation and BEiT
24
  Monocular depth estimation, aiming to infer detailed depth from a single image or camera view, finds applications in fields like generative AI, 3D reconstruction, and autonomous driving. However, deriving depth from individual pixels in a single image is challenging due to the underconstrained nature of the problem. Recent advancements attribute progress to learning-based methods, particularly with MiDaS, leveraging dataset mixing and scale-and-shift-invariant loss. MiDaS has evolved with releases featuring more powerful backbones and lightweight variants for mobile use. With the rise of transformer architectures in computer vision, including those pioneered by models like ViT, there's been a shift towards using them for depth estimation. Inspired by this, MiDaS v3.1 incorporates promising transformer-based encoders alongside traditional convolutional ones, aiming for a comprehensive investigation of depth estimation techniques. The paper focuses on describing the integration of these backbones into MiDaS, providing a thorough comparison of different v3.1 models, and offering guidance on utilizing future backbones with MiDaS.
25
 
26
- | Input Image | Output Depth Image |
27
  | --- | --- |
28
  | ![input image](https://cdn-uploads.huggingface.co/production/uploads/63dc702662dc193e6d460f1b/PDwRwuryaO3YtuyRjraiM.jpeg) | ![Depth image](https://cdn-uploads.huggingface.co/production/uploads/63dc702662dc193e6d460f1b/ugqri6LcqJBuU9zI9aeqN.jpeg) |
29
 
@@ -136,6 +136,9 @@ result["depth"]
136
  ```
137
 
138
  ## Quantitative Analyses
 
 
 
139
  | Model | Square Resolution HRWSI RMSE | Square Resolution Blended MVS REL | Square Resolution ReDWeb RMSE |
140
  | --- | --- | --- | --- |
141
  | BEiT 384-L | 0.068 | 0.070 | 0.076 |
@@ -155,6 +158,31 @@ result["depth"]
155
  | ViT-L Reversed | 0.071 | 0.073 | 0.081 |
156
  | Swin-L Equidistant | 0.072 | 0.074 | 0.083 |
157
  | --- | --- | --- | --- |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
158
 
159
  ### BibTeX entry and citation info
160
 
 
23
  # Overview of Monocular depth estimation and BEiT
24
  Monocular depth estimation, aiming to infer detailed depth from a single image or camera view, finds applications in fields like generative AI, 3D reconstruction, and autonomous driving. However, deriving depth from individual pixels in a single image is challenging due to the underconstrained nature of the problem. Recent advancements attribute progress to learning-based methods, particularly with MiDaS, leveraging dataset mixing and scale-and-shift-invariant loss. MiDaS has evolved with releases featuring more powerful backbones and lightweight variants for mobile use. With the rise of transformer architectures in computer vision, including those pioneered by models like ViT, there's been a shift towards using them for depth estimation. Inspired by this, MiDaS v3.1 incorporates promising transformer-based encoders alongside traditional convolutional ones, aiming for a comprehensive investigation of depth estimation techniques. The paper focuses on describing the integration of these backbones into MiDaS, providing a thorough comparison of different v3.1 models, and offering guidance on utilizing future backbones with MiDaS.
25
 
26
+ | Input Image (images.cocodataset.org/val2017/000000039769.jpg) | Output Depth Image |
27
  | --- | --- |
28
  | ![input image](https://cdn-uploads.huggingface.co/production/uploads/63dc702662dc193e6d460f1b/PDwRwuryaO3YtuyRjraiM.jpeg) | ![Depth image](https://cdn-uploads.huggingface.co/production/uploads/63dc702662dc193e6d460f1b/ugqri6LcqJBuU9zI9aeqN.jpeg) |
29
 
 
136
  ```
137
 
138
  ## Quantitative Analyses
139
+
140
+ See the paper for more in depth analysis: https://arxiv.org/pdf/2307.14460.pdf
141
+
142
  | Model | Square Resolution HRWSI RMSE | Square Resolution Blended MVS REL | Square Resolution ReDWeb RMSE |
143
  | --- | --- | --- | --- |
144
  | BEiT 384-L | 0.068 | 0.070 | 0.076 |
 
158
  | ViT-L Reversed | 0.071 | 0.073 | 0.081 |
159
  | Swin-L Equidistant | 0.072 | 0.074 | 0.083 |
160
  | --- | --- | --- | --- |
161
+ ### Model evaluation (post first training stage)
162
+
163
+ Model evaluation (post first training stage). The table
164
+ shows the validation of unpublished models which were mostly
165
+ trained only in the first training stage and not also the second
166
+ one due to low depth estimation quality (see Sec. 3.3). The
167
+ models above the horizontal separator line (between Next-ViTL-1K-6M and DeiT3-L-22K-1K) are included for a comparison
168
+ with the other models and have at least a released variant in
169
+ Tab. 1, although they were also not released directly (see Sec. 4.2
170
+ for details). For Swin-L, two different training runs are shown.
171
+ The models above the dashed separator are models based on
172
+ transformer backbones, and the models between the dashed and
173
+ dotted line are convolutional ones. The rows below the dotted
174
+ separator are models with experimental modifications as explained
175
+ in Sec. 4.3. All the models in this table are trained on the 3+10
176
+ dataset configuration (in contrast to the mixtures of Tabs. 1 and 2).
177
+ Validation is done on the datasets HRWSI [48], BlendedMVS [50]
178
+ and ReDWeb [24]. The errors used for validation are the root
179
+ mean square error of the disparity (RMSE) and the mean absolute
180
+ value of the relative error (REL), see Sec. 4.1. Note that DeiT3-
181
+ L-22K-1K is DeiT3-L pretrained on ImageNet-22k and fine-tuned
182
+ on ImageNet-1K, Next-ViT-L-1K is the shortened form of NextViT-L ImageNet-1K and Next-ViT-L-1K-6M stands for Next-ViTL ImageNet-1K-6M. The model in italics is a retrained legacy
183
+ model from MiDaS v3.0. The rows are ordered such that better
184
+ models are at the top. The best numbers per column are bold and
185
+ second best underlined.
186
 
187
  ### BibTeX entry and citation info
188