Salesforce/codegen-2B-multi · Mismatch in attention weights for causal masked tokens vs attention masked tokens

Original thread in (and copied from) https://github.com/salesforce/CodeGen/issues/49

attention scores corresponding to the tokens that are masked out using attention_mask get a value of -1e4 as per https://github.com/salesforce/CodeGen/blob/main/jaxformer/hf/codegen/modeling_codegen.py#L439, whereas the attention scores masked out using causal_mask get a value of -1*1e9. This leads to a discrepancy between the pre-softmax attention scores for causally masked tokens and padded tokens to be different. This causes inference outputs from individual sequences to inference outputs to be different.