arxiv:2407.20229

Improving 2D Feature Representations by 3D-Aware Fine-Tuning

Published on Jul 29

· Submitted by

yuanwenyue on Aug 1

Upvote

Authors:

Yuanwen Yue ,

Abstract

Current visual foundation models are trained purely on unstructured 2D data, limiting their understanding of 3D structure of objects and scenes. In this work, we show that fine-tuning on 3D-aware data improves the quality of emerging semantic features. We design a method to lift semantic 2D features into an efficient 3D Gaussian representation, which allows us to re-render them for arbitrary views. Using the rendered 3D-aware features, we design a fine-tuning strategy to transfer such 3D awareness into a 2D foundation model. We demonstrate that models fine-tuned in that way produce features that readily improve downstream task performance in semantic segmentation and depth estimation through simple linear probing. Notably, though fined-tuned on a single indoor dataset, the improvement is transferable to a variety of indoor datasets and out-of-domain datasets. We hope our study encourages the community to consider injecting 3D awareness when training 2D foundation models. Project page: https://ywyue.github.io/FiT3D.

View arXiv page View PDF Add to collection

Community

yuanwenyue

Paper author Paper submitter Aug 1

TL;DR: We propose 3D-aware fine-tuning to improve 2D foundation features. Our method starts with lifting 2D image features (e.g. DINOv2) to a feature Gaussian representation. Then we finetune the 2D foundation model using the rendered 3D-aware features. We demonstrate that incorporating the fine-tuned features results in improved performance on downstream tasks such as semantic segmentation and depth estimation on a variety of datasets with simple linear probing.

Project page: https://ywyue.github.io/FiT3D/
Code: https://github.com/ywyue/FiT3D

nielsr

Aug 3

Hi @yuanwenyue ,

Congrats on this work! Really interesting.

I saw the models are linked to the paper, which is great :) however, we usually recommend uploading models to separate model repositories, leveraging this guide: https://huggingface.co/docs/hub/models-uploading#upload-a-pytorch-model-using-huggingfacehub. This way, each model repo contains a config.json along with safetensors weights.

Btw, since the models like DINOv2, CLIP and MAE are available in the Transformers library, feel free to convert them to their Transformers counterpart :) this can be achieved using their respective conversion scripts, e.g. this one for DINOv2: https://github.com/huggingface/transformers/blob/main/src/transformers/models/dinov2/convert_dinov2_to_hf.py.

Kind regards,

Niels
Open-source @ HF

librarian-bot

Aug 2

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2407.20229 in a dataset README.md to link it from this page.

Improving 2D Feature Representations by 3D-Aware Fine-Tuning

Abstract

Community

Models citing this paper 1

Datasets citing this paper 0

Spaces citing this paper 1

Collections including this paper 5