Add metadata and hf_hub_download (#1)
Browse files- Add metadata and hf_hub_download (efedecd9dde07ac032d70e8a56fe5e848a588810)
Co-authored-by: Niels Rogge <nielsr@users.noreply.huggingface.co>
README.md
CHANGED
@@ -1,238 +1,243 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
4 |
-
|
5 |
-
|
6 |
-
|
7 |
-
|
8 |
-
<img src="demo/nerf-mae_teaser.
|
9 |
-
|
10 |
-
|
11 |
-
<
|
12 |
-
<
|
13 |
-
|
14 |
-
|
15 |
-
|
16 |
-
|
17 |
-
|
18 |
-
|
19 |
-
[](https://arxiv.org/abs/2404.01300)
|
19 |
+
[](https://nerf-mae.github.io)
|
20 |
+
[](https://pytorch.org/)
|
21 |
+
[](https://github.com/zubair-irshad/NeRF-MAE?tab=readme-ov-file#citation)
|
23 |
+
[](https://youtu.be/D60hlhmeuJI?si=d4RfHAwBJgLJXdKj)
|
25 |
+
|
26 |
+
|
27 |
+
|
28 |
+
</div>
|
29 |
+
|
30 |
+
---
|
31 |
+
|
32 |
+
<a href="https://www.tri.global/" target="_blank">
|
33 |
+
<img align="right" src="demo/GeorgiaTech_RGB.png" width="18%"/>
|
34 |
+
</a>
|
35 |
+
|
36 |
+
<a href="https://www.tri.global/" target="_blank">
|
37 |
+
<img align="right" src="demo/tri-logo.png" width="17%"/>
|
38 |
+
</a>
|
39 |
+
|
40 |
+
### [Project Page](https://nerf-mae.github.io/) | [arXiv](https://arxiv.org/abs/2308.12967) | [PDF](https://arxiv.org/pdf/2308.12967.pdf)
|
41 |
+
|
42 |
+
|
43 |
+
|
44 |
+
**NeRF-MAE : Masked AutoEncoders for Self-Supervised 3D Representation Learning for Neural Radiance Fields**
|
45 |
+
|
46 |
+
<a href="https://zubairirshad.com"><strong>Muhammad Zubair Irshad</strong></a>
|
47 |
+
·
|
48 |
+
<a href="https://zakharos.github.io/"><strong>Sergey Zakharov</strong></a>
|
49 |
+
·
|
50 |
+
<a href="https://www.linkedin.com/in/vitorguizilini"><strong>Vitor Guizilini</strong></a>
|
51 |
+
·
|
52 |
+
<a href="https://adriengaidon.com/"><strong>Adrien Gaidon</strong></a>
|
53 |
+
·
|
54 |
+
<a href="https://faculty.cc.gatech.edu/~zk15/"><strong>Zsolt Kira</strong></a>
|
55 |
+
·
|
56 |
+
<a href="https://www.tri.global/about-us/dr-rares-ambrus"><strong>Rares Ambrus</strong></a>
|
57 |
+
<br> **European Conference on Computer Vision, ECCV 2024**<br>
|
58 |
+
|
59 |
+
<b> Toyota Research Institute | Georgia Institute of Technology</b>
|
60 |
+
|
61 |
+
## 💡 Highlights
|
62 |
+
- **NeRF-MAE**: The first large-scale pretraining utilizing Neural Radiance Fields (NeRF) as an input modality. We pretrain a single Transformer model on thousands of NeRFs for 3D representation learning.
|
63 |
+
- **NeRF-MAE Dataset**: A large-scale NeRF pretraining and downstream task finetuning dataset.
|
64 |
+
|
65 |
+
## 🏷️ TODO 🚀
|
66 |
+
|
67 |
+
- [x] Release large-scale pretraining code 🚀
|
68 |
+
- [x] Release NeRF-MAE dataset comprising radiance and density grids 🚀
|
69 |
+
- [x] Release 3D object detection finetuning and eval code 🚀
|
70 |
+
- [x] Pretrained NeRF-MAE checkpoints and out-of-the-box model usage 🚀
|
71 |
+
|
72 |
+
## NeRF-MAE Model Architecture
|
73 |
+
<p align="center">
|
74 |
+
<img src="demo/nerf-mae_architecture.jpg" width="90%">
|
75 |
+
</p>
|
76 |
+
|
77 |
+
|
78 |
+
## Citation
|
79 |
+
|
80 |
+
If you find this repository or our dataset useful, please star ⭐ this repository and consider citing 📝:
|
81 |
+
|
82 |
+
```
|
83 |
+
@inproceedings{irshad2024nerfmae,
|
84 |
+
title={NeRF-MAE: Masked AutoEncoders for Self-Supervised 3D Representation Learning for Neural Radiance Fields},
|
85 |
+
author={Muhammad Zubair Irshad and Sergey Zakharov and Vitor Guizilini and Adrien Gaidon and Zsolt Kira and Rares Ambrus},
|
86 |
+
booktitle={European Conference on Computer Vision (ECCV)},
|
87 |
+
year={2024}
|
88 |
+
}
|
89 |
+
```
|
90 |
+
|
91 |
+
### Contents
|
92 |
+
- [🌇 Environment](#-environment)
|
93 |
+
- [⛳ Model Usage and Checkpoints](#-model-usage-and-checkpoints)
|
94 |
+
- [🗂️ Dataset](#-dataset)
|
95 |
+
|
96 |
+
## 🌇 Environment
|
97 |
+
|
98 |
+
Create a python 3.7 virtual environment and install requirements:
|
99 |
+
|
100 |
+
```bash
|
101 |
+
cd $NeRF-MAE repo
|
102 |
+
conda create -n nerf-mae python=3.9
|
103 |
+
conda activate nerf-mae
|
104 |
+
pip install --upgrade pip
|
105 |
+
pip install -r requirements.txt
|
106 |
+
pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 -f https://download.pytorch.org/whl/torch_stable.html
|
107 |
+
```
|
108 |
+
The code was built and tested on **cuda 11.3**
|
109 |
+
|
110 |
+
Compile CUDA extension, to run downstream task finetuning, as described in [NeRF-RPN](https://github.com/lyclyc52/NeRF_RPN):
|
111 |
+
|
112 |
+
```bash
|
113 |
+
cd $NeRF-MAE repo
|
114 |
+
cd nerf_rpn/model/rotated_iou/cuda_op
|
115 |
+
python setup.py install
|
116 |
+
cd ../../../..
|
117 |
+
|
118 |
+
```
|
119 |
+
|
120 |
+
## ⛳ Model Usage and Checkpoints
|
121 |
+
|
122 |
+
- [Hugginface repo to download pretrained and finetuned checkpoints](https://huggingface.co/mirshad7/NeRF-MAE)
|
123 |
+
|
124 |
+
NeRF-MAE is structured to provide easy access to pretrained NeRF-MAE models (and reproductions), to facilitate use for various downstream tasks. This is for extracting good visual features from NeRFs if you don't have resources for large-scale pretraining. Our pretraining provides an easy-to-access embedding of any NeRF scene, which can be used for a variety of downstream tasks in a straightforwaed way.
|
125 |
+
|
126 |
+
We have released pretrained and finetuned checkpoints to start using our codebase out-of-the-box. There are two types of usages. 1. Most common one is using the features directly in a downstream task such as an FPN head for 3D Object Detection and 2. Reconstruct the original grid for enforcing losses such as masked reconstruction loss. Below is a sample useage of our model with spelled out comments.
|
127 |
+
|
128 |
+
|
129 |
+
1. Get the features to be used in a downstream task
|
130 |
+
|
131 |
+
```python
|
132 |
+
import torch
|
133 |
+
|
134 |
+
# Define Swin Transformer configurations
|
135 |
+
swin_config = {
|
136 |
+
"swin_t": {"embed_dim": 96, "depths": [2, 2, 6, 2], "num_heads": [3, 6, 12, 24]},
|
137 |
+
"swin_s": {"embed_dim": 96, "depths": [2, 2, 18, 2], "num_heads": [3, 6, 12, 24]},
|
138 |
+
"swin_b": {"embed_dim": 128, "depths": [2, 2, 18, 2], "num_heads": [3, 6, 12, 24]},
|
139 |
+
"swin_l": {"embed_dim": 192, "depths": [2, 2, 18, 2], "num_heads": [6, 12, 24, 48]},
|
140 |
+
}
|
141 |
+
|
142 |
+
# Set the desired backbone type
|
143 |
+
backbone_type = "swin_s"
|
144 |
+
config = swin_config[backbone_type]
|
145 |
+
|
146 |
+
# Initialize Swin Transformer model
|
147 |
+
model = SwinTransformer_MAE3D_New(
|
148 |
+
patch_size=[4, 4, 4],
|
149 |
+
embed_dim=config["embed_dim"],
|
150 |
+
depths=config["depths"],
|
151 |
+
num_heads=config["num_heads"],
|
152 |
+
window_size=[4, 4, 4],
|
153 |
+
stochastic_depth_prob=0.1,
|
154 |
+
expand_dim=True,
|
155 |
+
resolution=resolution,
|
156 |
+
)
|
157 |
+
|
158 |
+
# Load checkpoint and remove unused layers
|
159 |
+
checkpoint_path = hf_hub_download(repo_id="mirshad7/NeRF-MAE", filename="nerf_mae_pretrained.pt")
|
160 |
+
checkpoint = torch.load(checkpoint_path, map_location="cpu")
|
161 |
+
model.load_state_dict(checkpoint["state_dict"])
|
162 |
+
for attr in ["decoder4", "decoder3", "decoder2", "decoder1", "out", "mask_token"]:
|
163 |
+
delattr(model, attr)
|
164 |
+
|
165 |
+
# Extract features using Swin Transformer backbone. input_grid has sample shape torch.randn((1, 4, 160, 160, 160))
|
166 |
+
features = []
|
167 |
+
input_grid = model.patch_partition(input_grid) + model.pos_embed.type_as(input_grid).to(input_grid.device).clone().detach()
|
168 |
+
for stage in model.stages:
|
169 |
+
input_grid = stage(input_grid)
|
170 |
+
features.append(torch.permute(input_grid, [0, 4, 1, 2, 3]).contiguous()) # Format: [N, C, H, W, D]
|
171 |
+
|
172 |
+
#Multi-scale features have shape: [torch.Size([1, 96, 40, 40, 40]), torch.Size([1, 192, 20, 20, 20]), torch.Size([1, 384, 10, 10, 10]), torch.Size([1, 768, 5, 5, 5])]
|
173 |
+
|
174 |
+
# Process features through FPN
|
175 |
+
```
|
176 |
+
|
177 |
+
2. Get the Original Grid Output
|
178 |
+
```python
|
179 |
+
import torch
|
180 |
+
# Load data from the specified folder and filename with the given resolution.
|
181 |
+
res, rgbsigma = load_data(folder_name, filename, resolution=args.resolution)
|
182 |
+
|
183 |
+
# rgbsigma has sample shape torch.randn((1, 4, 160, 160, 160))
|
184 |
+
|
185 |
+
# Build the model using provided arguments.
|
186 |
+
model = build_model(args)
|
187 |
+
|
188 |
+
# Load checkpoint if provided.
|
189 |
+
if args.checkpoint:
|
190 |
+
model.load_state_dict(torch.load(args.checkpoint, map_location="cpu")["state_dict"])
|
191 |
+
model.eval() # Set model to evaluation mode.
|
192 |
+
|
193 |
+
# Run inference getting the features out for downsteam usage
|
194 |
+
with torch.no_grad():
|
195 |
+
pred = model([rgbsigma], is_eval=True)[3] # Extract only predictions.
|
196 |
+
|
197 |
+
```
|
198 |
+
|
199 |
+
### 1. How to plug these features for downstream 3D bounding detection from NeRFs (i.e. plug-and-play with a [NeRF-RPN](https://github.com/lyclyc52/NeRF_RPN) OBB prediction head)
|
200 |
+
|
201 |
+
Please also see the section on [Finetuning](#-finetuning). Our released finetuned checkpoint achieves state-of-the-art on 3D object detection in NeRFs. To run evaluation using our finetuned checkpoint on the dataset provided by NeRF-RPN, please run the below script, after updating the paths to the pretrained checkpoint i.e. --checkpoint and DATA_ROOT depending on evaluation done for ```Front3D``` or ```Scannet```:
|
202 |
+
|
203 |
+
```
|
204 |
+
bash test_fcos_pretrained.sh
|
205 |
+
```
|
206 |
+
|
207 |
+
Also see the cooresponding run file i.e. ```run_fcos_pretrained.py``` and our model adaptation i.e. ```SwinTransformer_FPN_Pretrained_Skip```. This is a minimal adaptation to plug and play our weights with a NeRF-RPN architecture and achieve significant boost in performance.
|
208 |
+
|
209 |
+
|
210 |
+
## 🗂️ Dataset
|
211 |
+
|
212 |
+
Download the preprocessed datasets here.
|
213 |
+
|
214 |
+
- Pretraining dataset (comprising NeRF radiance and density grids). [Download link](https://s3.amazonaws.com/tri-ml-public.s3.amazonaws.com/github/nerfmae/NeRF-MAE_pretrain.tar.gz)
|
215 |
+
- Finetuning dataset (comprising NeRF radiance and density grids and bounding box/semantic labelling annotations). [3D Object Detection (Provided by NeRF-RPN)](https://drive.google.com/drive/folders/1q2wwLi6tSXu1hbEkMyfAKKdEEGQKT6pj), [3D Semantic Segmentation (Coming Soon)](), [Voxel-Super Resolution (Coming Soon)]()
|
216 |
+
|
217 |
+
|
218 |
+
Extract pretraining and finetuning dataset under ```NeRF-MAE/datasets```. The directory structure should look like this:
|
219 |
+
|
220 |
+
```
|
221 |
+
NeRF-MAE
|
222 |
+
├── pretrain
|
223 |
+
│ ├── features
|
224 |
+
│ └── nerfmae_split.npz
|
225 |
+
└── finetune
|
226 |
+
└── front3d_rpn_data
|
227 |
+
├── features
|
228 |
+
├── aabb
|
229 |
+
└── obb
|
230 |
+
```
|
231 |
+
|
232 |
+
|
233 |
+
Note: The above datasets are all you need to train and evaluate our method. Bonus: we will be releasing our multi-view rendered posed RGB images from FRONT3D, HM3D and Hypersim as well as Instant-NGP trained checkpoints soon (these comprise over 1M+ images and 3k+ NeRF checkpoints)
|
234 |
+
|
235 |
+
Please note that our dataset was generated using the instruction from [NeRF-RPN](https://github.com/lyclyc52/NeRF_RPN) and [3D-CLR](https://vis-www.cs.umass.edu/3d-clr/). Please consider citing our work, NeRF-RPN and 3D-CLR if you find this dataset useful in your research.
|
236 |
+
|
237 |
+
Please also note that our dataset uses [Front3D](https://arxiv.org/abs/2011.09127), [Habitat-Matterport3D](https://arxiv.org/abs/2109.08238), [HyperSim](https://github.com/apple/ml-hypersim) and [ScanNet](https://www.scan-net.org/) as the base version of the dataset i.e. we train a NeRF per scene and extract radiance and desnity grid as well as aligned NeRF-grid 3D annotations. Please read the term of use for each dataset if you want to utilize the posed multi-view images for each of these datasets.
|
238 |
+
|
239 |
+
### For More details, please checkout out Paper, Github and Project Page!
|
240 |
+
|
241 |
+
---
|
242 |
+
license: cc-by-nc-4.0
|
243 |
+
---
|