Spaces:

jeasinema
/

UltraEdit-SD3

Running on Zero

App Files Files Community

UltraEdit-SD3 / UltraEdit /diffusers /docs /source /en /using-diffusers /kandinsky.md

BleachNick

upload required packages

87d40d2 7 months ago

preview code

raw

history blame contribute delete

31.7 kB

	<!--Copyright 2024 The HuggingFace Team. All rights reserved.

	Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
	the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
	an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
	specific language governing permissions and limitations under the License.
	-->

	# Kandinsky

	[[open-in-colab]]

	The Kandinsky models are a series of multilingual text-to-image generation models. The Kandinsky 2.0 model uses two multilingual text encoders and concatenates those results for the UNet.

	[Kandinsky 2.1](../api/pipelines/kandinsky) changes the architecture to include an image prior model ([`CLIP`](https://huggingface.co/docs/transformers/model_doc/clip)) to generate a mapping between text and image embeddings. The mapping provides better text-image alignment and it is used with the text embeddings during training, leading to higher quality results. Finally, Kandinsky 2.1 uses a [Modulating Quantized Vectors (MoVQ)](https://huggingface.co/papers/2209.09002) decoder - which adds a spatial conditional normalization layer to increase photorealism - to decode the latents into images.

	[Kandinsky 2.2](../api/pipelines/kandinsky_v22) improves on the previous model by replacing the image encoder of the image prior model with a larger CLIP-ViT-G model to improve quality. The image prior model was also retrained on images with different resolutions and aspect ratios to generate higher-resolution images and different image sizes.

	[Kandinsky 3](../api/pipelines/kandinsky3) simplifies the architecture and shifts away from the two-stage generation process involving the prior model and diffusion model. Instead, Kandinsky 3 uses [Flan-UL2](https://huggingface.co/google/flan-ul2) to encode text, a UNet with [BigGan-deep](https://hf.co/papers/1809.11096) blocks, and [Sber-MoVQGAN](https://github.com/ai-forever/MoVQGAN) to decode the latents into images. Text understanding and generated image quality are primarily achieved by using a larger text encoder and UNet.

	This guide will show you how to use the Kandinsky models for text-to-image, image-to-image, inpainting, interpolation, and more.

	Before you begin, make sure you have the following libraries installed:

	```py
	# uncomment to install the necessary libraries in Colab
	#!pip install -q diffusers transformers accelerate
	```

	<Tip warning={true}>

	Kandinsky 2.1 and 2.2 usage is very similar! The only difference is Kandinsky 2.2 doesn't accept `prompt` as an input when decoding the latents. Instead, Kandinsky 2.2 only accepts `image_embeds` during decoding.

	<br>

	Kandinsky 3 has a more concise architecture and it doesn't require a prior model. This means it's usage is identical to other diffusion models like [Stable Diffusion XL](sdxl).

	</Tip>

	## Text-to-image

	To use the Kandinsky models for any task, you always start by setting up the prior pipeline to encode the prompt and generate the image embeddings. The prior pipeline also generates `negative_image_embeds` that correspond to the negative prompt `""`. For better results, you can pass an actual `negative_prompt` to the prior pipeline, but this'll increase the effective batch size of the prior pipeline by 2x.

	<hfoptions id="text-to-image">
	<hfoption id="Kandinsky 2.1">

	```py
	from diffusers import KandinskyPriorPipeline, KandinskyPipeline
	import torch

	prior_pipeline = KandinskyPriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-prior", torch_dtype=torch.float16).to("cuda")
	pipeline = KandinskyPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16).to("cuda")

	prompt = "A alien cheeseburger creature eating itself, claymation, cinematic, moody lighting"
	negative_prompt = "low quality, bad quality" # optional to include a negative prompt, but results are usually better
	image_embeds, negative_image_embeds = prior_pipeline(prompt, negative_prompt, guidance_scale=1.0).to_tuple()
	```

	Now pass all the prompts and embeddings to the [`KandinskyPipeline`] to generate an image:

	```py
	image = pipeline(prompt, image_embeds=image_embeds, negative_prompt=negative_prompt, negative_image_embeds=negative_image_embeds, height=768, width=768).images[0]
	image
	```

	<div class="flex justify-center">
	<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/kandinsky-docs/cheeseburger.png"/>
	</div>

	</hfoption>
	<hfoption id="Kandinsky 2.2">

	```py
	from diffusers import KandinskyV22PriorPipeline, KandinskyV22Pipeline
	import torch

	prior_pipeline = KandinskyV22PriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16).to("cuda")
	pipeline = KandinskyV22Pipeline.from_pretrained("kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16).to("cuda")

	prompt = "A alien cheeseburger creature eating itself, claymation, cinematic, moody lighting"
	negative_prompt = "low quality, bad quality" # optional to include a negative prompt, but results are usually better
	image_embeds, negative_image_embeds = prior_pipeline(prompt, guidance_scale=1.0).to_tuple()
	```

	Pass the `image_embeds` and `negative_image_embeds` to the [`KandinskyV22Pipeline`] to generate an image:

	```py
	image = pipeline(image_embeds=image_embeds, negative_image_embeds=negative_image_embeds, height=768, width=768).images[0]
	image
	```

	<div class="flex justify-center">
	<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/kandinsky-text-to-image.png"/>
	</div>

	</hfoption>
	<hfoption id="Kandinsky 3">

	Kandinsky 3 doesn't require a prior model so you can directly load the [`Kandinsky3Pipeline`] and pass a prompt to generate an image:

	```py
	from diffusers import Kandinsky3Pipeline
	import torch

	pipeline = Kandinsky3Pipeline.from_pretrained("kandinsky-community/kandinsky-3", variant="fp16", torch_dtype=torch.float16)
	pipeline.enable_model_cpu_offload()

	prompt = "A alien cheeseburger creature eating itself, claymation, cinematic, moody lighting"
	image = pipeline(prompt).images[0]
	image
	```

	</hfoption>
	</hfoptions>

	🤗 Diffusers also provides an end-to-end API with the [`KandinskyCombinedPipeline`] and [`KandinskyV22CombinedPipeline`], meaning you don't have to separately load the prior and text-to-image pipeline. The combined pipeline automatically loads both the prior model and the decoder. You can still set different values for the prior pipeline with the `prior_guidance_scale` and `prior_num_inference_steps` parameters if you want.

	Use the [`AutoPipelineForText2Image`] to automatically call the combined pipelines under the hood:

	<hfoptions id="text-to-image">
	<hfoption id="Kandinsky 2.1">

	```py
	from diffusers import AutoPipelineForText2Image
	import torch

	pipeline = AutoPipelineForText2Image.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16)
	pipeline.enable_model_cpu_offload()

	prompt = "A alien cheeseburger creature eating itself, claymation, cinematic, moody lighting"
	negative_prompt = "low quality, bad quality"

	image = pipeline(prompt=prompt, negative_prompt=negative_prompt, prior_guidance_scale=1.0, guidance_scale=4.0, height=768, width=768).images[0]
	image
	```

	</hfoption>
	<hfoption id="Kandinsky 2.2">

	```py
	from diffusers import AutoPipelineForText2Image
	import torch

	pipeline = AutoPipelineForText2Image.from_pretrained("kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16)
	pipeline.enable_model_cpu_offload()

	prompt = "A alien cheeseburger creature eating itself, claymation, cinematic, moody lighting"
	negative_prompt = "low quality, bad quality"

	image = pipeline(prompt=prompt, negative_prompt=negative_prompt, prior_guidance_scale=1.0, guidance_scale=4.0, height=768, width=768).images[0]
	image
	```

	</hfoption>
	</hfoptions>

	## Image-to-image

	For image-to-image, pass the initial image and text prompt to condition the image to the pipeline. Start by loading the prior pipeline:

	<hfoptions id="image-to-image">
	<hfoption id="Kandinsky 2.1">

	```py
	import torch
	from diffusers import KandinskyImg2ImgPipeline, KandinskyPriorPipeline

	prior_pipeline = KandinskyPriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-prior", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
	pipeline = KandinskyImg2ImgPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
	```

	</hfoption>
	<hfoption id="Kandinsky 2.2">

	```py
	import torch
	from diffusers import KandinskyV22Img2ImgPipeline, KandinskyPriorPipeline

	prior_pipeline = KandinskyPriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
	pipeline = KandinskyV22Img2ImgPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
	```

	</hfoption>
	<hfoption id="Kandinsky 3">

	Kandinsky 3 doesn't require a prior model so you can directly load the image-to-image pipeline:

	```py
	from diffusers import Kandinsky3Img2ImgPipeline
	from diffusers.utils import load_image
	import torch

	pipeline = Kandinsky3Img2ImgPipeline.from_pretrained("kandinsky-community/kandinsky-3", variant="fp16", torch_dtype=torch.float16)
	pipeline.enable_model_cpu_offload()
	```

	</hfoption>
	</hfoptions>

	Download an image to condition on:

	```py
	from diffusers.utils import load_image

	# download image
	url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
	original_image = load_image(url)
	original_image = original_image.resize((768, 512))
	```

	<div class="flex justify-center">
	<img class="rounded-xl" src="https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"/>
	</div>

	Generate the `image_embeds` and `negative_image_embeds` with the prior pipeline:

	```py
	prompt = "A fantasy landscape, Cinematic lighting"
	negative_prompt = "low quality, bad quality"

	image_embeds, negative_image_embeds = prior_pipeline(prompt, negative_prompt).to_tuple()
	```

	Now pass the original image, and all the prompts and embeddings to the pipeline to generate an image:

	<hfoptions id="image-to-image">
	<hfoption id="Kandinsky 2.1">

	```py
	from diffusers.utils import make_image_grid

	image = pipeline(prompt, negative_prompt=negative_prompt, image=original_image, image_embeds=image_embeds, negative_image_embeds=negative_image_embeds, height=768, width=768, strength=0.3).images[0]
	make_image_grid([original_image.resize((512, 512)), image.resize((512, 512))], rows=1, cols=2)
	```

	<div class="flex justify-center">
	<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/kandinsky-docs/img2img_fantasyland.png"/>
	</div>

	</hfoption>
	<hfoption id="Kandinsky 2.2">

	```py
	from diffusers.utils import make_image_grid

	image = pipeline(image=original_image, image_embeds=image_embeds, negative_image_embeds=negative_image_embeds, height=768, width=768, strength=0.3).images[0]
	make_image_grid([original_image.resize((512, 512)), image.resize((512, 512))], rows=1, cols=2)
	```

	<div class="flex justify-center">
	<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/kandinsky-image-to-image.png"/>
	</div>

	</hfoption>
	<hfoption id="Kandinsky 3">

	```py
	image = pipeline(prompt, negative_prompt=negative_prompt, image=image, strength=0.75, num_inference_steps=25).images[0]
	image
	```

	</hfoption>
	</hfoptions>

	🤗 Diffusers also provides an end-to-end API with the [`KandinskyImg2ImgCombinedPipeline`] and [`KandinskyV22Img2ImgCombinedPipeline`], meaning you don't have to separately load the prior and image-to-image pipeline. The combined pipeline automatically loads both the prior model and the decoder. You can still set different values for the prior pipeline with the `prior_guidance_scale` and `prior_num_inference_steps` parameters if you want.

	Use the [`AutoPipelineForImage2Image`] to automatically call the combined pipelines under the hood:

	<hfoptions id="image-to-image">
	<hfoption id="Kandinsky 2.1">

	```py
	from diffusers import AutoPipelineForImage2Image
	from diffusers.utils import make_image_grid, load_image
	import torch

	pipeline = AutoPipelineForImage2Image.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16, use_safetensors=True)
	pipeline.enable_model_cpu_offload()

	prompt = "A fantasy landscape, Cinematic lighting"
	negative_prompt = "low quality, bad quality"

	url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
	original_image = load_image(url)

	original_image.thumbnail((768, 768))

	image = pipeline(prompt=prompt, negative_prompt=negative_prompt, image=original_image, strength=0.3).images[0]
	make_image_grid([original_image.resize((512, 512)), image.resize((512, 512))], rows=1, cols=2)
	```

	</hfoption>
	<hfoption id="Kandinsky 2.2">

	```py
	from diffusers import AutoPipelineForImage2Image
	from diffusers.utils import make_image_grid, load_image
	import torch

	pipeline = AutoPipelineForImage2Image.from_pretrained("kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16)
	pipeline.enable_model_cpu_offload()

	prompt = "A fantasy landscape, Cinematic lighting"
	negative_prompt = "low quality, bad quality"

	url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
	original_image = load_image(url)

	original_image.thumbnail((768, 768))

	image = pipeline(prompt=prompt, negative_prompt=negative_prompt, image=original_image, strength=0.3).images[0]
	make_image_grid([original_image.resize((512, 512)), image.resize((512, 512))], rows=1, cols=2)
	```

	</hfoption>
	</hfoptions>

	## Inpainting

	<Tip warning={true}>

	⚠️ The Kandinsky models use ⬜️ white pixels to represent the masked area now instead of black pixels. If you are using [`KandinskyInpaintPipeline`] in production, you need to change the mask to use white pixels:

	```py
	# For PIL input
	import PIL.ImageOps
	mask = PIL.ImageOps.invert(mask)

	# For PyTorch and NumPy input
	mask = 1 - mask
	```

	</Tip>

	For inpainting, you'll need the original image, a mask of the area to replace in the original image, and a text prompt of what to inpaint. Load the prior pipeline:

	<hfoptions id="inpaint">
	<hfoption id="Kandinsky 2.1">

	```py
	from diffusers import KandinskyInpaintPipeline, KandinskyPriorPipeline
	from diffusers.utils import load_image, make_image_grid
	import torch
	import numpy as np
	from PIL import Image

	prior_pipeline = KandinskyPriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-prior", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
	pipeline = KandinskyInpaintPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-inpaint", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
	```

	</hfoption>
	<hfoption id="Kandinsky 2.2">

	```py
	from diffusers import KandinskyV22InpaintPipeline, KandinskyV22PriorPipeline
	from diffusers.utils import load_image, make_image_grid
	import torch
	import numpy as np
	from PIL import Image

	prior_pipeline = KandinskyV22PriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
	pipeline = KandinskyV22InpaintPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-decoder-inpaint", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
	```

	</hfoption>
	</hfoptions>

	Load an initial image and create a mask:

	```py
	init_image = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/cat.png")
	mask = np.zeros((768, 768), dtype=np.float32)
	# mask area above cat's head
	mask[:250, 250:-250] = 1
	```

	Generate the embeddings with the prior pipeline:

	```py
	prompt = "a hat"
	prior_output = prior_pipeline(prompt)
	```

	Now pass the initial image, mask, and prompt and embeddings to the pipeline to generate an image:

	<hfoptions id="inpaint">
	<hfoption id="Kandinsky 2.1">

	```py
	output_image = pipeline(prompt, image=init_image, mask_image=mask, **prior_output, height=768, width=768, num_inference_steps=150).images[0]
	mask = Image.fromarray((mask*255).astype('uint8'), 'L')
	make_image_grid([init_image, mask, output_image], rows=1, cols=3)
	```

	<div class="flex justify-center">
	<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/kandinsky-docs/inpaint_cat_hat.png"/>
	</div>

	</hfoption>
	<hfoption id="Kandinsky 2.2">

	```py
	output_image = pipeline(image=init_image, mask_image=mask, **prior_output, height=768, width=768, num_inference_steps=150).images[0]
	mask = Image.fromarray((mask*255).astype('uint8'), 'L')
	make_image_grid([init_image, mask, output_image], rows=1, cols=3)
	```

	<div class="flex justify-center">
	<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/kandinskyv22-inpaint.png"/>
	</div>

	</hfoption>
	</hfoptions>

	You can also use the end-to-end [`KandinskyInpaintCombinedPipeline`] and [`KandinskyV22InpaintCombinedPipeline`] to call the prior and decoder pipelines together under the hood. Use the [`AutoPipelineForInpainting`] for this:

	<hfoptions id="inpaint">
	<hfoption id="Kandinsky 2.1">

	```py
	import torch
	import numpy as np
	from PIL import Image
	from diffusers import AutoPipelineForInpainting
	from diffusers.utils import load_image, make_image_grid

	pipe = AutoPipelineForInpainting.from_pretrained("kandinsky-community/kandinsky-2-1-inpaint", torch_dtype=torch.float16)
	pipe.enable_model_cpu_offload()

	init_image = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/cat.png")
	mask = np.zeros((768, 768), dtype=np.float32)
	# mask area above cat's head
	mask[:250, 250:-250] = 1
	prompt = "a hat"

	output_image = pipe(prompt=prompt, image=init_image, mask_image=mask).images[0]
	mask = Image.fromarray((mask*255).astype('uint8'), 'L')
	make_image_grid([init_image, mask, output_image], rows=1, cols=3)
	```

	</hfoption>
	<hfoption id="Kandinsky 2.2">

	```py
	import torch
	import numpy as np
	from PIL import Image
	from diffusers import AutoPipelineForInpainting
	from diffusers.utils import load_image, make_image_grid

	pipe = AutoPipelineForInpainting.from_pretrained("kandinsky-community/kandinsky-2-2-decoder-inpaint", torch_dtype=torch.float16)
	pipe.enable_model_cpu_offload()

	init_image = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/cat.png")
	mask = np.zeros((768, 768), dtype=np.float32)
	# mask area above cat's head
	mask[:250, 250:-250] = 1
	prompt = "a hat"

	output_image = pipe(prompt=prompt, image=original_image, mask_image=mask).images[0]
	mask = Image.fromarray((mask*255).astype('uint8'), 'L')
	make_image_grid([init_image, mask, output_image], rows=1, cols=3)
	```

	</hfoption>
	</hfoptions>

	## Interpolation

	Interpolation allows you to explore the latent space between the image and text embeddings which is a cool way to see some of the prior model's intermediate outputs. Load the prior pipeline and two images you'd like to interpolate:

	<hfoptions id="interpolate">
	<hfoption id="Kandinsky 2.1">

	```py
	from diffusers import KandinskyPriorPipeline, KandinskyPipeline
	from diffusers.utils import load_image, make_image_grid
	import torch

	prior_pipeline = KandinskyPriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-prior", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
	img_1 = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/cat.png")
	img_2 = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/starry_night.jpeg")
	make_image_grid([img_1.resize((512,512)), img_2.resize((512,512))], rows=1, cols=2)
	```

	</hfoption>
	<hfoption id="Kandinsky 2.2">

	```py
	from diffusers import KandinskyV22PriorPipeline, KandinskyV22Pipeline
	from diffusers.utils import load_image, make_image_grid
	import torch

	prior_pipeline = KandinskyV22PriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
	img_1 = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/cat.png")
	img_2 = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/starry_night.jpeg")
	make_image_grid([img_1.resize((512,512)), img_2.resize((512,512))], rows=1, cols=2)
	```

	</hfoption>
	</hfoptions>

	<div class="flex gap-4">
	<div>
	<img class="rounded-xl" src="https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/cat.png"/>
	<figcaption class="mt-2 text-center text-sm text-gray-500">a cat</figcaption>
	</div>
	<div>
	<img class="rounded-xl" src="https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/starry_night.jpeg"/>
	<figcaption class="mt-2 text-center text-sm text-gray-500">Van Gogh's Starry Night painting</figcaption>
	</div>
	</div>

	Specify the text or images to interpolate, and set the weights for each text or image. Experiment with the weights to see how they affect the interpolation!

	```py
	images_texts = ["a cat", img_1, img_2]
	weights = [0.3, 0.3, 0.4]
	```

	Call the `interpolate` function to generate the embeddings, and then pass them to the pipeline to generate the image:

	<hfoptions id="interpolate">
	<hfoption id="Kandinsky 2.1">

	```py
	# prompt can be left empty
	prompt = ""
	prior_out = prior_pipeline.interpolate(images_texts, weights)

	pipeline = KandinskyPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16, use_safetensors=True).to("cuda")

	image = pipeline(prompt, **prior_out, height=768, width=768).images[0]
	image
	```

	<div class="flex justify-center">
	<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/kandinsky-docs/starry_cat.png"/>
	</div>

	</hfoption>
	<hfoption id="Kandinsky 2.2">

	```py
	# prompt can be left empty
	prompt = ""
	prior_out = prior_pipeline.interpolate(images_texts, weights)

	pipeline = KandinskyV22Pipeline.from_pretrained("kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16, use_safetensors=True).to("cuda")

	image = pipeline(prompt, **prior_out, height=768, width=768).images[0]
	image
	```

	<div class="flex justify-center">
	<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/kandinskyv22-interpolate.png"/>
	</div>

	</hfoption>
	</hfoptions>

	## ControlNet

	<Tip warning={true}>

	⚠️ ControlNet is only supported for Kandinsky 2.2!

	</Tip>

	ControlNet enables conditioning large pretrained diffusion models with additional inputs such as a depth map or edge detection. For example, you can condition Kandinsky 2.2 with a depth map so the model understands and preserves the structure of the depth image.

	Let's load an image and extract it's depth map:

	```py
	from diffusers.utils import load_image

	img = load_image(
	"https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinskyv22/cat.png"
	).resize((768, 768))
	img
	```

	<div class="flex justify-center">
	<img class="rounded-xl" src="https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinskyv22/cat.png"/>
	</div>

	Then you can use the `depth-estimation` [`~transformers.Pipeline`] from 🤗 Transformers to process the image and retrieve the depth map:

	```py
	import torch
	import numpy as np

	from transformers import pipeline

	def make_hint(image, depth_estimator):
	image = depth_estimator(image)["depth"]
	image = np.array(image)
	image = image[:, :, None]
	image = np.concatenate([image, image, image], axis=2)
	detected_map = torch.from_numpy(image).float() / 255.0
	hint = detected_map.permute(2, 0, 1)
	return hint

	depth_estimator = pipeline("depth-estimation")
	hint = make_hint(img, depth_estimator).unsqueeze(0).half().to("cuda")
	```

	### Text-to-image [[controlnet-text-to-image]]

	Load the prior pipeline and the [`KandinskyV22ControlnetPipeline`]:

	```py
	from diffusers import KandinskyV22PriorPipeline, KandinskyV22ControlnetPipeline

	prior_pipeline = KandinskyV22PriorPipeline.from_pretrained(
	"kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16, use_safetensors=True
	).to("cuda")

	pipeline = KandinskyV22ControlnetPipeline.from_pretrained(
	"kandinsky-community/kandinsky-2-2-controlnet-depth", torch_dtype=torch.float16
	).to("cuda")
	```

	Generate the image embeddings from a prompt and negative prompt:

	```py
	prompt = "A robot, 4k photo"
	negative_prior_prompt = "lowres, text, error, cropped, worst quality, low quality, jpeg artifacts, ugly, duplicate, morbid, mutilated, out of frame, extra fingers, mutated hands, poorly drawn hands, poorly drawn face, mutation, deformed, blurry, dehydrated, bad anatomy, bad proportions, extra limbs, cloned face, disfigured, gross proportions, malformed limbs, missing arms, missing legs, extra arms, extra legs, fused fingers, too many fingers, long neck, username, watermark, signature"

	generator = torch.Generator(device="cuda").manual_seed(43)

	image_emb, zero_image_emb = prior_pipeline(
	prompt=prompt, negative_prompt=negative_prior_prompt, generator=generator
	).to_tuple()
	```

	Finally, pass the image embeddings and the depth image to the [`KandinskyV22ControlnetPipeline`] to generate an image:

	```py
	image = pipeline(image_embeds=image_emb, negative_image_embeds=zero_image_emb, hint=hint, num_inference_steps=50, generator=generator, height=768, width=768).images[0]
	image
	```

	<div class="flex justify-center">
	<img class="rounded-xl" src="https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinskyv22/robot_cat_text2img.png"/>
	</div>

	### Image-to-image [[controlnet-image-to-image]]

	For image-to-image with ControlNet, you'll need to use the:

	- [`KandinskyV22PriorEmb2EmbPipeline`] to generate the image embeddings from a text prompt and an image
	- [`KandinskyV22ControlnetImg2ImgPipeline`] to generate an image from the initial image and the image embeddings

	Process and extract a depth map of an initial image of a cat with the `depth-estimation` [`~transformers.Pipeline`] from 🤗 Transformers:

	```py
	import torch
	import numpy as np

	from diffusers import KandinskyV22PriorEmb2EmbPipeline, KandinskyV22ControlnetImg2ImgPipeline
	from diffusers.utils import load_image
	from transformers import pipeline

	img = load_image(
	"https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinskyv22/cat.png"
	).resize((768, 768))

	def make_hint(image, depth_estimator):
	image = depth_estimator(image)["depth"]
	image = np.array(image)
	image = image[:, :, None]
	image = np.concatenate([image, image, image], axis=2)
	detected_map = torch.from_numpy(image).float() / 255.0
	hint = detected_map.permute(2, 0, 1)
	return hint

	depth_estimator = pipeline("depth-estimation")
	hint = make_hint(img, depth_estimator).unsqueeze(0).half().to("cuda")
	```

	Load the prior pipeline and the [`KandinskyV22ControlnetImg2ImgPipeline`]:

	```py
	prior_pipeline = KandinskyV22PriorEmb2EmbPipeline.from_pretrained(
	"kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16, use_safetensors=True
	).to("cuda")

	pipeline = KandinskyV22ControlnetImg2ImgPipeline.from_pretrained(
	"kandinsky-community/kandinsky-2-2-controlnet-depth", torch_dtype=torch.float16
	).to("cuda")
	```

	Pass a text prompt and the initial image to the prior pipeline to generate the image embeddings:

	```py
	prompt = "A robot, 4k photo"
	negative_prior_prompt = "lowres, text, error, cropped, worst quality, low quality, jpeg artifacts, ugly, duplicate, morbid, mutilated, out of frame, extra fingers, mutated hands, poorly drawn hands, poorly drawn face, mutation, deformed, blurry, dehydrated, bad anatomy, bad proportions, extra limbs, cloned face, disfigured, gross proportions, malformed limbs, missing arms, missing legs, extra arms, extra legs, fused fingers, too many fingers, long neck, username, watermark, signature"

	generator = torch.Generator(device="cuda").manual_seed(43)

	img_emb = prior_pipeline(prompt=prompt, image=img, strength=0.85, generator=generator)
	negative_emb = prior_pipeline(prompt=negative_prior_prompt, image=img, strength=1, generator=generator)
	```

	Now you can run the [`KandinskyV22ControlnetImg2ImgPipeline`] to generate an image from the initial image and the image embeddings:

	```py
	image = pipeline(image=img, strength=0.5, image_embeds=img_emb.image_embeds, negative_image_embeds=negative_emb.image_embeds, hint=hint, num_inference_steps=50, generator=generator, height=768, width=768).images[0]
	make_image_grid([img.resize((512, 512)), image.resize((512, 512))], rows=1, cols=2)
	```

	<div class="flex justify-center">
	<img class="rounded-xl" src="https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinskyv22/robot_cat.png"/>
	</div>

	## Optimizations

	Kandinsky is unique because it requires a prior pipeline to generate the mappings, and a second pipeline to decode the latents into an image. Optimization efforts should be focused on the second pipeline because that is where the bulk of the computation is done. Here are some tips to improve Kandinsky during inference.

	1. Enable [xFormers](../optimization/xformers) if you're using PyTorch < 2.0:

	```diff
	from diffusers import DiffusionPipeline
	import torch

	pipe = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16)
	+ pipe.enable_xformers_memory_efficient_attention()
	```

	2. Enable `torch.compile` if you're using PyTorch >= 2.0 to automatically use scaled dot-product attention (SDPA):

	```diff
	pipe.unet.to(memory_format=torch.channels_last)
	+ pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
	```

	This is the same as explicitly setting the attention processor to use [`~models.attention_processor.AttnAddedKVProcessor2_0`]:

	```py
	from diffusers.models.attention_processor import AttnAddedKVProcessor2_0

	pipe.unet.set_attn_processor(AttnAddedKVProcessor2_0())
	```

	3. Offload the model to the CPU with [`~KandinskyPriorPipeline.enable_model_cpu_offload`] to avoid out-of-memory errors:

	```diff
	from diffusers import DiffusionPipeline
	import torch

	pipe = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16)
	+ pipe.enable_model_cpu_offload()
	```

	4. By default, the text-to-image pipeline uses the [`DDIMScheduler`] but you can replace it with another scheduler like [`DDPMScheduler`] to see how that affects the tradeoff between inference speed and image quality:

	```py
	from diffusers import DDPMScheduler
	from diffusers import DiffusionPipeline

	scheduler = DDPMScheduler.from_pretrained("kandinsky-community/kandinsky-2-1", subfolder="ddpm_scheduler")
	pipe = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", scheduler=scheduler, torch_dtype=torch.float16, use_safetensors=True).to("cuda")
	```