Mistral sliding_window implementation and flash_attn_func

#154
by SadRick - opened

I am trying to fine-tune mistral-7b using huggingface trainer and flash-attention. I have observed strange behaviour of the sliding_window, where changing its size has no effect on the training at all. I assumed that by manipulating sliding_window size I might be able to fit longer sequences into my model, but neither VRAM usage nor training time seems to be affected by sliding_window size.

Sliding_window in transformers library

I implemented some checks in MistralFlashAttention2 class (tested with a sequence length of 2048 and sliding_window sizes of 1, 512 and 2048). I found that the sliding_window sizes were working correctly:

  • use_sliding_window parameter was True
  • sliding_window size was correctly displayed
  • flash_attn_func was called with sliding_window parameter

Memory usage and training time in sagemaker

I measured peak GPU memory allocated for each training step and also time for each step:
(example)
Peak GPU memory allocated at step 92: 8239454208 bytes
Step 92 took 472.382896900177 seconds.

These measurements did not change, regardless of what sliding_window was set to (same with system logs on wandb). This seems odd to me, can someone help me understand this behaviour?

Sign up or log in to comment