initial inference time - 30-40 sec 😁

1) lowered num_steps for diffusion model to 10 from 20 - inference time = 17-19 sec 👍

2) moved onxx model from cpu compute to gpu - inference time = 12-14 sec cold start take more time 😀


working 

1 preprocess images -
  first target human image preprocess with openpose and humanparse
  openpose - to get pose information of joints
  humanparse- to segment image with diffrent part like face, body , background that we can use to 
              determine where to do diffusion using mask

  merging mask from humanparse on original human image that will we feed into diffusion model

  processing cloth image -

    with torch.no_grad():
        prompt_image = self.auto_processor(images=image_garm, return_tensors="pt").to('cuda')
        prompt_image = self.image_encoder(prompt_image.data['pixel_values']).image_embeds
        prompt_image = prompt_image.unsqueeze(1)
        if model_type == 'hd':
            prompt_embeds = self.text_encoder(self.tokenize_captions([""], 2).to('cuda'))[0]
            prompt_embeds[:, 1:] = prompt_image[:]
        elif model_type == 'dc':
            prompt_embeds = self.text_encoder(self.tokenize_captions([category], 3).to('cuda'))[0]
            prompt_embeds = torch.cat([prompt_embeds, prompt_image], dim=1)

    
    this will convert cloth image into image embedding and generate prompt embedding using category we provide


   GatedSelfAttentionDense: This class combines visual features and object features using self-attention.
    It's likely used to fuse information about the clothing items with the human body image.

    2 at last we feed both human masked image and 
    cloth image embedding and prompt embedding concated - [image_embeds, prompt_embeds]
     into diffusion model then running inference the diffusion model -
     first it will convert image input as latent embedding using VAE ,
     then perform difussion with paramter we provided, samples, num_steps, noise, seed, etc
     after num of steps of diffusion we convert that output again in image spce using VAE 
     and thats our output image