placeholder tokens are zero initialized

#89

by xdseunghyun - opened Jul 5

Jul 5

hi guys, thank you for sharing an awesome model !
I'm trying to fine-tuning phi3 medium but found gradient return NaN right after first optimization step.
and i found some tokens are zero initialized and this is the reason.
when i add another special token with 0.02 init (conventional 1/d_model std) and train with this token, it was fine.

so here's my question,
does zero vector can cause NaN ? why you init placeholder tokens with zero?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment