Mismatch in attention weights for causal masked tokens vs attention masked tokens

#1
by LakshyAAAgrawal - opened

Original thread in (and copied from) https://github.com/salesforce/CodeGen/issues/49

attention scores corresponding to the tokens that are masked out using attention_mask get a value of -1e4 as per https://github.com/salesforce/CodeGen/blob/main/jaxformer/hf/codegen/modeling_codegen.py#L439, whereas the attention scores masked out using causal_mask get a value of -1*1e9. This leads to a discrepancy between the pre-softmax attention scores for causally masked tokens and padded tokens to be different. This causes inference outputs from individual sequences to inference outputs to be different.

Sign up or log in to comment