Improving 2D Feature Representations by 3D-Aware Fine-Tuning
Abstract
Current visual foundation models are trained purely on unstructured 2D data, limiting their understanding of 3D structure of objects and scenes. In this work, we show that fine-tuning on 3D-aware data improves the quality of emerging semantic features. We design a method to lift semantic 2D features into an efficient 3D Gaussian representation, which allows us to re-render them for arbitrary views. Using the rendered 3D-aware features, we design a fine-tuning strategy to transfer such 3D awareness into a 2D foundation model. We demonstrate that models fine-tuned in that way produce features that readily improve downstream task performance in semantic segmentation and depth estimation through simple linear probing. Notably, though fined-tuned on a single indoor dataset, the improvement is transferable to a variety of indoor datasets and out-of-domain datasets. We hope our study encourages the community to consider injecting 3D awareness when training 2D foundation models. Project page: https://ywyue.github.io/FiT3D.
Community
TL;DR: We propose 3D-aware fine-tuning to improve 2D foundation features. Our method starts with lifting 2D image features (e.g. DINOv2) to a feature Gaussian representation. Then we finetune the 2D foundation model using the rendered 3D-aware features. We demonstrate that incorporating the fine-tuned features results in improved performance on downstream tasks such as semantic segmentation and depth estimation on a variety of datasets with simple linear probing.
Project page: https://ywyue.github.io/FiT3D/
Code: https://github.com/ywyue/FiT3D
Hi @yuanwenyue ,
Congrats on this work! Really interesting.
I saw the models are linked to the paper, which is great :) however, we usually recommend uploading models to separate model repositories, leveraging this guide: https://huggingface.co/docs/hub/models-uploading#upload-a-pytorch-model-using-huggingfacehub. This way, each model repo contains a config.json along with safetensors weights.
Btw, since the models like DINOv2, CLIP and MAE are available in the Transformers library, feel free to convert them to their Transformers counterpart :) this can be achieved using their respective conversion scripts, e.g. this one for DINOv2: https://github.com/huggingface/transformers/blob/main/src/transformers/models/dinov2/convert_dinov2_to_hf.py.
Kind regards,
Niels
Open-source @ HF
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- DistillNeRF: Perceiving 3D Scenes from Single-Glance Images by Distilling Neural Fields and Foundation Model Features (2024)
- Query-based Semantic Gaussian Field for Scene Representation in Reinforcement Learning (2024)
- 4D Contrastive Superflows are Dense 3D Representation Learners (2024)
- Accessing Vision Foundation Models at ImageNet-level Costs (2024)
- OpenGaussian: Towards Point-Level 3D Gaussian-based Open Vocabulary Understanding (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper