microsoft/Phi-3-mini-128k-instruct · Continue the discussion: `long_factor` and `short

J22

Apr 25

Originally, long_factor is never used, since inv_freq is likely to be always initialized by short_factor. Then a PR(https://huggingface.co/microsoft/Phi-3-mini-128k-instruct/discussions/25) is merged to fix it.

But, unfortunately, the code is still confusion. Suppose when this is called for the first time, 5000 tokens are passed in , then, long_factor is also used the first 4096 tokens. Is this intentional?

    

@torch
	.no_grad()
    def forward(self, x, position_ids, seq_len=None):
        seq_len = torch.max(position_ids) + 1
        if seq_len > self.original_max_position_embeddings:
            ext_factors = torch.tensor(self.long_factor, dtype=torch.float32, device=x.device)
        else:
            ext_factors = torch.tensor(self.short_factor, dtype=torch.float32, device=x.device)

gugarosa

Microsoft org Apr 26

Yes! Comment from the authors:

When a batch contains long and short sequences, it will always use long factor, even for short samples.
Currently we don't support such mixed batches.

gugarosa changed discussion status to closed Apr 26

J22

Apr 27

So, theoretically, for the first 4k tokens, short factor is used, while for a token after 4k, long factor is used.

"Mixed batches is not supported" is a limitation (if not a bug) of current implementation.

Then, does it have negative impact on performance if "always use long factor" for a mixed batch?

microsoft
/

Phi-3-mini-128k-instruct

Continue the discussion: `long_factor` and `short_factor`