Order of the LayerNorm in T5 Model

#28
by dkarthikeyan1 - opened

Hi all,

Was just going through the T5 paper and noticed that the authors mention that the LayerNorm was different to the Vaswani et al. 2017 AAYN paper in that the AAYN paper implements LayerNorm on the outputs of the multi-headed attention (MHA) and FFN such that we get LayerNorm(x + SubLayer(x)) whereas T5 applies it on the inputs of the MHA and FFN such that the residual connection becomes: LayerNorm(x) or just x + SubLayer(LayerNorm(x). However when I looked at the T5 model I noticed that the T5LayerNorm comes after the T5Attention. Is this just how the model architecture is printed or a potential detraction from the paper?

Thanks!

dkarthikeyan1 changed discussion status to closed

Sign up or log in to comment