stabilityai/stable-diffusion-3.5-medium

I am trying to manually implement the model for research purposes.
I have loaded all of the tokenizers and text encoders for the model.
I am performing a step-by-step encoding of the prompt as mentioned in part B.2 of the paper.
I extract both pooled outputs of shapes 768 and 1280 from the CLIP models and concatenate them.
I input the concatenated tensor to the SD3Transformer2DModel as the 'pooled_projections' param.
I also extacrt both encodings from the clip models of shapes [77,768] and [77,1280] and concatenate them to the mentioned c_ctxt_clip of shape [77,2048]
Then I pad it with zeros to be in shape [77,4096].
I also use the T5 encoder to obtain c_ctxt_t5 of shape [77,4096] then concatenate them to obtain c_ctxt of shape [154,4096]
it seems pretty straight forward but for some reason at the end of the denoising process I decode the it with the vae and I get an RGB image of 1024x1024 but not something that makes sense (as in the image below).
Am I missing anything?

stabilityai
/

stable-diffusion-3.5-medium

Text embedding