Spaces:
Runtime error
Runtime error
initial inference time - 30-40 sec π | |
1) lowered num_steps for diffusion model to 10 from 20 - inference time = 17-19 sec π | |
2) moved onxx model from cpu compute to gpu - inference time = 12-14 sec cold start take more time π | |
working | |
1 preprocess images - | |
first target human image preprocess with openpose and humanparse | |
openpose - to get pose information of joints | |
humanparse- to segment image with diffrent part like face, body , background that we can use to | |
determine where to do diffusion using mask | |
merging mask from humanparse on original human image that will we feed into diffusion model | |
processing cloth image - | |
with torch.no_grad(): | |
prompt_image = self.auto_processor(images=image_garm, return_tensors="pt").to('cuda') | |
prompt_image = self.image_encoder(prompt_image.data['pixel_values']).image_embeds | |
prompt_image = prompt_image.unsqueeze(1) | |
if model_type == 'hd': | |
prompt_embeds = self.text_encoder(self.tokenize_captions([""], 2).to('cuda'))[0] | |
prompt_embeds[:, 1:] = prompt_image[:] | |
elif model_type == 'dc': | |
prompt_embeds = self.text_encoder(self.tokenize_captions([category], 3).to('cuda'))[0] | |
prompt_embeds = torch.cat([prompt_embeds, prompt_image], dim=1) | |
this will convert cloth image into image embedding and generate prompt embedding using category we provide | |
GatedSelfAttentionDense: This class combines visual features and object features using self-attention. | |
It's likely used to fuse information about the clothing items with the human body image. | |
2 at last we feed both human masked image and | |
cloth image embedding and prompt embedding concated - [image_embeds, prompt_embeds] | |
into diffusion model then running inference the diffusion model - | |
first it will convert image input as latent embedding using VAE , | |
then perform difussion with paramter we provided, samples, num_steps, noise, seed, etc | |
after num of steps of diffusion we convert that output again in image spce using VAE | |
and thats our output image |