initial inference time - 30-40 sec 😁 1) lowered num_steps for diffusion model to 10 from 20 - inference time = 17-19 sec 👍 2) moved onxx model from cpu compute to gpu - inference time = 12-14 sec cold start take more time 😀 working 1 preprocess images - first target human image preprocess with openpose and humanparse openpose - to get pose information of joints humanparse- to segment image with diffrent part like face, body , background that we can use to determine where to do diffusion using mask merging mask from humanparse on original human image that will we feed into diffusion model processing cloth image - with torch.no_grad(): prompt_image = self.auto_processor(images=image_garm, return_tensors="pt").to('cuda') prompt_image = self.image_encoder(prompt_image.data['pixel_values']).image_embeds prompt_image = prompt_image.unsqueeze(1) if model_type == 'hd': prompt_embeds = self.text_encoder(self.tokenize_captions([""], 2).to('cuda'))[0] prompt_embeds[:, 1:] = prompt_image[:] elif model_type == 'dc': prompt_embeds = self.text_encoder(self.tokenize_captions([category], 3).to('cuda'))[0] prompt_embeds = torch.cat([prompt_embeds, prompt_image], dim=1) this will convert cloth image into image embedding and generate prompt embedding using category we provide GatedSelfAttentionDense: This class combines visual features and object features using self-attention. It's likely used to fuse information about the clothing items with the human body image. 2 at last we feed both human masked image and cloth image embedding and prompt embedding concated - [image_embeds, prompt_embeds] into diffusion model then running inference the diffusion model - first it will convert image input as latent embedding using VAE , then perform difussion with paramter we provided, samples, num_steps, noise, seed, etc after num of steps of diffusion we convert that output again in image spce using VAE and thats our output image