Continue the discussion: `long_factor` and `short_factor`
#32
by
J22
- opened
Originally, long_factor
is never used, since inv_freq
is likely to be always initialized by short_factor
. Then a PR(https://huggingface.co/microsoft/Phi-3-mini-128k-instruct/discussions/25) is merged to fix it.
But, unfortunately, the code is still confusion. Suppose when this is called for the first time, 5000 tokens are passed in , then, long_factor
is also used the first 4096 tokens. Is this intentional?
@torch
.no_grad()
def forward(self, x, position_ids, seq_len=None):
seq_len = torch.max(position_ids) + 1
if seq_len > self.original_max_position_embeddings:
ext_factors = torch.tensor(self.long_factor, dtype=torch.float32, device=x.device)
else:
ext_factors = torch.tensor(self.short_factor, dtype=torch.float32, device=x.device)
Yes! Comment from the authors:
When a batch contains long and short sequences, it will always use long factor, even for short samples.
Currently we don't support such mixed batches.
gugarosa
changed discussion status to
closed
So, theoretically, for the first 4k tokens, short factor is used, while for a token after 4k, long factor is used.
"Mixed batches is not supported" is a limitation (if not a bug) of current implementation.
Then, does it have negative impact on performance if "always use long factor" for a mixed batch?