GT_VTR3_1 / debugging_setps.txt
Ubuntu
improved inference time
3bc69b8
initial inference time - 30-40 sec 😁
1) lowered num_steps for diffusion model to 10 from 20 - inference time = 17-19 sec πŸ‘
2) moved onxx model from cpu compute to gpu - inference time = 12-14 sec cold start take more time πŸ˜€
working
1 preprocess images -
first target human image preprocess with openpose and humanparse
openpose - to get pose information of joints
humanparse- to segment image with diffrent part like face, body , background that we can use to
determine where to do diffusion using mask
merging mask from humanparse on original human image that will we feed into diffusion model
processing cloth image -
with torch.no_grad():
prompt_image = self.auto_processor(images=image_garm, return_tensors="pt").to('cuda')
prompt_image = self.image_encoder(prompt_image.data['pixel_values']).image_embeds
prompt_image = prompt_image.unsqueeze(1)
if model_type == 'hd':
prompt_embeds = self.text_encoder(self.tokenize_captions([""], 2).to('cuda'))[0]
prompt_embeds[:, 1:] = prompt_image[:]
elif model_type == 'dc':
prompt_embeds = self.text_encoder(self.tokenize_captions([category], 3).to('cuda'))[0]
prompt_embeds = torch.cat([prompt_embeds, prompt_image], dim=1)
this will convert cloth image into image embedding and generate prompt embedding using category we provide
GatedSelfAttentionDense: This class combines visual features and object features using self-attention.
It's likely used to fuse information about the clothing items with the human body image.
2 at last we feed both human masked image and
cloth image embedding and prompt embedding concated - [image_embeds, prompt_embeds]
into diffusion model then running inference the diffusion model -
first it will convert image input as latent embedding using VAE ,
then perform difussion with paramter we provided, samples, num_steps, noise, seed, etc
after num of steps of diffusion we convert that output again in image spce using VAE
and thats our output image