Continue the discussion: `long_factor` and `short_factor`

#32
by J22 - opened

Originally, long_factor is never used, since inv_freq is likely to be always initialized by short_factor. Then a PR(https://huggingface.co/microsoft/Phi-3-mini-128k-instruct/discussions/25) is merged to fix it.

But, unfortunately, the code is still confusion. Suppose when this is called for the first time, 5000 tokens are passed in , then, long_factor is also used the first 4096 tokens. Is this intentional?

    

@torch
	.no_grad()
    def forward(self, x, position_ids, seq_len=None):
        seq_len = torch.max(position_ids) + 1
        if seq_len > self.original_max_position_embeddings:
            ext_factors = torch.tensor(self.long_factor, dtype=torch.float32, device=x.device)
        else:
            ext_factors = torch.tensor(self.short_factor, dtype=torch.float32, device=x.device)
Microsoft org

Yes! Comment from the authors:

When a batch contains long and short sequences, it will always use long factor, even for short samples.
Currently we don't support such mixed batches.
gugarosa changed discussion status to closed

So, theoretically, for the first 4k tokens, short factor is used, while for a token after 4k, long factor is used.

"Mixed batches is not supported" is a limitation (if not a bug) of current implementation.

Then, does it have negative impact on performance if "always use long factor" for a mixed batch?

Sign up or log in to comment