Activation function

#56
by aboros98 - opened

Hello!

The model should use exact gelu or approximate gelu as a gating function in mlp?

I am asking this because in PyTorch and HF the model uses exact gelu, while in keras and JAX approximate gelu is used.

Thanks!

Hi @aboros98
It should be approximate gelu I think, see: https://twitter.com/danielhanchen/status/1763613620909580505

Sign up or log in to comment