File size: 3,333 Bytes
361d25f b2ae312 361d25f c0f2950 361d25f a678f06 361d25f a678f06 361d25f c0f2950 361d25f c0f2950 361d25f 80cf1d0 361d25f 80cf1d0 361d25f 738baaf 361d25f 80cf1d0 361d25f 80cf1d0 361d25f 80cf1d0 361d25f 58e39b8 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 |
---
license: creativeml-openrail-m
datasets:
- laion/laion400m
tags:
- stable-diffusion
- stable-diffusion-diffusers
- text-to-image
language:
- en
pipeline_tag: text-to-3d
---
# LDM3D-VR model
The LDM3D-VR model was proposed in ["LDM3D-VR: Latent Diffusion Model for 3D"](https://arxiv.org/pdf/2311.03226.pdf) by Gabriela Ben Melech Stan, Diana Wofk, Estelle Aflalo, Shao-Yen Tseng, Zhipeng Cai, Michael Paulitsch, Vasudev Lal.
LDM3D-VR got accepted to [NeurIPS Workshop'23 on Diffusion Models][https://neurips.cc/virtual/2023/workshop/66539].
This new checkpoint related to the upscaler called LDM3D-sr.
# Model description
The abstract from the paper is the following: Latent diffusion models have proven to be state-of-the-art in the creation and manipulation of visual outputs. However, as far as we know, the generation of depth maps jointly with RGB is still limited. We introduce LDM3D-VR, a suite of diffusion models targeting virtual reality development that includes LDM3D-pano
and LDM3D-SR. These models enable the generation of panoramic RGBD based on textual prompts and the upscaling of low-resolution inputs to high-resolution RGBD, respectively. Our models are fine-tuned from existing pretrained models on datasets containing panoramic/high-resolution RGB images, depth maps and captions. Both models are evaluated in comparison to existing related methods.

<font size="2">LDM3D overview taken from [the original paper](https://arxiv.org/abs/2305.10853)</font>
### How to use
Here is how to use this model to get the features of a given text in PyTorch:
```python
from diffusers import StableDiffusionLDM3DPipeline
pipe = StableDiffusionLDM3DPipeline.from_pretrained("Intel/ldm3d-pano")
pipe.to("cuda")
prompt ="360 view of a large bedroom"
name = "bedroom_pano"
output = pipe(
prompt,
width=1024,
height=512,
guidance_scale=7.0,
num_inference_steps=50,
)
rgb_image, depth_image = output.rgb, output.depth
rgb_image[0].save(name+"_ldm3d_rgb.jpg")
depth_image[0].save(name+"_ldm3d_depth.png")
```
This is the result:

### Finetuning
This checkpoint finetunes the previous [ldm3d-4c](https://huggingface.co/Intel/ldm3d-4c) on 2 panoramic-images datasets:
- [polyhaven](https://polyhaven.com/): 585 images for the training set, 66 images for the validation set
- [ihdri](https://www.ihdri.com/hdri-skies-outdoor/): 57 outdoor images for the training set, 7 outdoor images for the validation set.
These datasets were augmented using [Text2Light](https://frozenburning.github.io/projects/text2light/) to create a dataset containing 13852 training samples and 1606 validation samples.
In order to generate the depth map of those samples, we used [DPT-large](https://github.com/isl-org/MiDaS) and to generate the caption we used [BLIP-2](https://huggingface.co/docs/transformers/main/model_doc/blip-2)
### BibTeX entry and citation info
@misc{stan2023ldm3dvr,
title={LDM3D-VR: Latent Diffusion Model for 3D VR},
author={Gabriela Ben Melech Stan and Diana Wofk and Estelle Aflalo and Shao-Yen Tseng and Zhipeng Cai and Michael Paulitsch and Vasudev Lal},
year={2023},
eprint={2311.03226},
archivePrefix={arXiv},
primaryClass={cs.CV}
} |