placeholder tokens are zero initialized
#89
by
xdseunghyun
- opened
hi guys, thank you for sharing an awesome model !
I'm trying to fine-tuning phi3 medium but found gradient return NaN right after first optimization step.
and i found some tokens are zero initialized and this is the reason.
when i add another special token with 0.02 init (conventional 1/d_model std) and train with this token, it was fine.
so here's my question,
does zero vector can cause NaN ? why you init placeholder tokens with zero?