Intel
/

dpt-beit-large-512

@@ -23,7 +23,7 @@ model-index:
 # Overview of Monocular depth estimation and BEiT
 Monocular depth estimation, aiming to infer detailed depth from a single image or camera view, finds applications in fields like generative AI, 3D reconstruction, and autonomous driving. However, deriving depth from individual pixels in a single image is challenging due to the underconstrained nature of the problem. Recent advancements attribute progress to learning-based methods, particularly with MiDaS, leveraging dataset mixing and scale-and-shift-invariant loss. MiDaS has evolved with releases featuring more powerful backbones and lightweight variants for mobile use. With the rise of transformer architectures in computer vision, including those pioneered by models like ViT, there's been a shift towards using them for depth estimation. Inspired by this, MiDaS v3.1 incorporates promising transformer-based encoders alongside traditional convolutional ones, aiming for a comprehensive investigation of depth estimation techniques. The paper focuses on describing the integration of these backbones into MiDaS, providing a thorough comparison of different v3.1 models, and offering guidance on utilizing future backbones with MiDaS.
-| Input Image | Output Depth Image |
 | --- | --- |
 | ![input image](https://cdn-uploads.huggingface.co/production/uploads/63dc702662dc193e6d460f1b/PDwRwuryaO3YtuyRjraiM.jpeg)  | ![Depth image](https://cdn-uploads.huggingface.co/production/uploads/63dc702662dc193e6d460f1b/ugqri6LcqJBuU9zI9aeqN.jpeg) |
@@ -136,6 +136,9 @@ result["depth"]
 ```
 ## Quantitative Analyses
 | Model | Square Resolution HRWSI RMSE  | Square Resolution Blended MVS REL | Square Resolution ReDWeb RMSE |
 | --- | --- | --- | --- |
 | BEiT 384-L | 0.068 | 0.070 | 0.076 |
@@ -155,6 +158,31 @@ result["depth"]
 | ViT-L Reversed | 0.071 | 0.073 | 0.081  |
 | Swin-L Equidistant  | 0.072 | 0.074  | 0.083  |
 | --- | --- | --- | --- |
 ### BibTeX entry and citation info

 # Overview of Monocular depth estimation and BEiT
 Monocular depth estimation, aiming to infer detailed depth from a single image or camera view, finds applications in fields like generative AI, 3D reconstruction, and autonomous driving. However, deriving depth from individual pixels in a single image is challenging due to the underconstrained nature of the problem. Recent advancements attribute progress to learning-based methods, particularly with MiDaS, leveraging dataset mixing and scale-and-shift-invariant loss. MiDaS has evolved with releases featuring more powerful backbones and lightweight variants for mobile use. With the rise of transformer architectures in computer vision, including those pioneered by models like ViT, there's been a shift towards using them for depth estimation. Inspired by this, MiDaS v3.1 incorporates promising transformer-based encoders alongside traditional convolutional ones, aiming for a comprehensive investigation of depth estimation techniques. The paper focuses on describing the integration of these backbones into MiDaS, providing a thorough comparison of different v3.1 models, and offering guidance on utilizing future backbones with MiDaS.
+| Input Image (images.cocodataset.org/val2017/000000039769.jpg) | Output Depth Image |
 | --- | --- |
 | ![input image](https://cdn-uploads.huggingface.co/production/uploads/63dc702662dc193e6d460f1b/PDwRwuryaO3YtuyRjraiM.jpeg)  | ![Depth image](https://cdn-uploads.huggingface.co/production/uploads/63dc702662dc193e6d460f1b/ugqri6LcqJBuU9zI9aeqN.jpeg) |
 ```
 ## Quantitative Analyses
+See the paper for more in depth analysis:  https://arxiv.org/pdf/2307.14460.pdf
 | Model | Square Resolution HRWSI RMSE  | Square Resolution Blended MVS REL | Square Resolution ReDWeb RMSE |
 | --- | --- | --- | --- |
 | BEiT 384-L | 0.068 | 0.070 | 0.076 |
 | ViT-L Reversed | 0.071 | 0.073 | 0.081  |
 | Swin-L Equidistant  | 0.072 | 0.074  | 0.083  |
 | --- | --- | --- | --- |
+### Model evaluation (post first training stage)
+Model evaluation (post first training stage). The table
+shows the validation of unpublished models which were mostly
+trained only in the first training stage and not also the second
+one due to low depth estimation quality (see Sec. 3.3). The
+models above the horizontal separator line (between Next-ViTL-1K-6M and DeiT3-L-22K-1K) are included for a comparison
+with the other models and have at least a released variant in
+Tab. 1, although they were also not released directly (see Sec. 4.2
+for details). For Swin-L, two different training runs are shown.
+The models above the dashed separator are models based on
+transformer backbones, and the models between the dashed and
+dotted line are convolutional ones. The rows below the dotted
+separator are models with experimental modifications as explained
+in Sec. 4.3. All the models in this table are trained on the 3+10
+dataset configuration (in contrast to the mixtures of Tabs. 1 and 2).
+Validation is done on the datasets HRWSI [48], BlendedMVS [50]
+and ReDWeb [24]. The errors used for validation are the root
+mean square error of the disparity (RMSE) and the mean absolute
+value of the relative error (REL), see Sec. 4.1. Note that DeiT3-
+L-22K-1K is DeiT3-L pretrained on ImageNet-22k and fine-tuned
+on ImageNet-1K, Next-ViT-L-1K is the shortened form of NextViT-L ImageNet-1K and Next-ViT-L-1K-6M stands for Next-ViTL ImageNet-1K-6M. The model in italics is a retrained legacy
+model from MiDaS v3.0. The rows are ordered such that better
+models are at the top. The best numbers per column are bold and
+second best underlined.
 ### BibTeX entry and citation info