add optional flash-attn

Browse files

also added default no to delete model

Files changed (8) hide show

auto-exl2-upload/INSTRUCTIONS.txt +5 -5
auto-exl2-upload/exl2-quant.py +5 -1
auto-exl2-upload/linux-setup.sh +30 -6
auto-exl2-upload/windows-setup.bat +24 -3
exl2-multi-quant-local/INSTRUCTIONS.txt +6 -6
exl2-multi-quant-local/exl2-quant.py +5 -1
exl2-multi-quant-local/linux-setup.sh +29 -5
exl2-multi-quant-local/windows-setup.bat +24 -2

auto-exl2-upload/INSTRUCTIONS.txt CHANGED Viewed

@@ -8,7 +8,7 @@ https://developer.nvidia.com/cuda-11-8-0-download-archive
 Restart your computer after installing the CUDA toolkit to make sure the PATH is set correctly.
-Haven't done much testing but for Windows, Visual Studio 2019 with desktop development for C++ might be required.
 https://visualstudio.microsoft.com/thank-you-downloading-visual-studio/?sku=community&rel=16&utm_medium=microsoft&utm_campaign=download+from+relnotes&utm_content=vs2019ga+button
 install the desktop development for C++ workload
@@ -19,11 +19,11 @@ For example, on Ubuntu use: sudo apt-get install build-essential
 This may work with AMD cards but only on linux and possibly WSL2. I can't guarantee that it will work on AMD cards, I personally don't have one to test with. You may need to install stuff before starting. https://rocm.docs.amd.com/projects/install-on-linux/en/latest/tutorial/quick-start.html
-Only python 3.8 - 3.11 is known to work. If you have a higher version of python, I can't guarantee that it will work.
-First setup your environment by using either windows.bat or linux.sh. If something fails during setup, then delete venv folder and try again.
 After setup is complete then you'll have a file called start-quant. Use this to run the quant script.
@@ -32,7 +32,7 @@ Make sure to also have a lot of RAM depending on the model. Have noticed gemma t
 If you close the terminal or the terminal crashes, check the last BPW it was on and enter the remaining quants you wanted. It should be able to pick up where it left off. Don't type the BPW of completed quants as it will start from the beginning. You may also use ctrl + c to pause at any time during the quant process.
-To add more options to the quantization process, you can add them to line 174. All options: https://github.com/turboderp/exllamav2/blob/master/doc/convert.md
 Things may break in the future as it downloads the latest version of all the dependencies which may either change names or how they work. If something breaks, please open a discussion at https://huggingface.co/Anthonyg5005/hf-scripts/discussions
@@ -46,4 +46,4 @@ https://github.com/oobabooga
 Credit to Lucain Pouget for maintaining huggingface-hub.
 https://github.com/Wauplin
-Only tested with CUDA 12.1 on Windows 11

 Restart your computer after installing the CUDA toolkit to make sure the PATH is set correctly.
+Visual Studio with desktop development for C++ is required.
 https://visualstudio.microsoft.com/thank-you-downloading-visual-studio/?sku=community&rel=16&utm_medium=microsoft&utm_campaign=download+from+relnotes&utm_content=vs2019ga+button
 install the desktop development for C++ workload
 This may work with AMD cards but only on linux and possibly WSL2. I can't guarantee that it will work on AMD cards, I personally don't have one to test with. You may need to install stuff before starting. https://rocm.docs.amd.com/projects/install-on-linux/en/latest/tutorial/quick-start.html
+Only python 3.8 - 3.12 is known to work. If you have a higher/lower version of python, I can't guarantee that it will work.
+First setup your environment by using either windows.bat or linux.sh.
 After setup is complete then you'll have a file called start-quant. Use this to run the quant script.
 If you close the terminal or the terminal crashes, check the last BPW it was on and enter the remaining quants you wanted. It should be able to pick up where it left off. Don't type the BPW of completed quants as it will start from the beginning. You may also use ctrl + c to pause at any time during the quant process.
+To add more options to the quantization process, you can add them to line 189. All options: https://github.com/turboderp/exllamav2/blob/master/doc/convert.md
 Things may break in the future as it downloads the latest version of all the dependencies which may either change names or how they work. If something breaks, please open a discussion at https://huggingface.co/Anthonyg5005/hf-scripts/discussions
 Credit to Lucain Pouget for maintaining huggingface-hub.
 https://github.com/Wauplin
+Only tested with CUDA 12.1 on Windows 11 and WSL2 Ubuntu 24.04

auto-exl2-upload/exl2-quant.py CHANGED Viewed

@@ -114,9 +114,13 @@ while priv2pub != 'y' and priv2pub != 'n':
 clear_screen()
 #ask to delete original fp16 weights
-delmodel = input("Do you want to delete the original model? (Won't delete if paused or failed) (y/n): ")
 while delmodel != 'y' and delmodel != 'n':
     delmodel = input("Please enter 'y' or 'n': ")
 clear_screen()
 #downloading the model

 clear_screen()
 #ask to delete original fp16 weights
+delmodel = input("Do you want to delete the original model? (Won't delete if paused or failed) (y/N): ")
+if delmodel == '':
+    delmodel = 'n'
 while delmodel != 'y' and delmodel != 'n':
     delmodel = input("Please enter 'y' or 'n': ")
+    if delmodel == '':
+        delmodel = 'n'
 clear_screen()
 #downloading the model

auto-exl2-upload/linux-setup.sh CHANGED Viewed

@@ -4,11 +4,15 @@
 # check if "venv" subdirectory exists, if not, create one
 if [ ! -d "venv" ]; then
-    python3 -m venv venv
 else
-    echo "venv directory already exists. If something is broken, delete venv folder and run this script again."
-    read -p "Press enter to continue"
-    exit
 fi
 # ask if the user has git installed
@@ -17,7 +21,9 @@ read -p "Do you have git and wget installed? (y/n) " gitwget
 if [ "$gitwget" = "y" ]; then
     echo "Setting up environment"
 else
-    echo "Please install git and wget before running this script."
     read -p "Press enter to continue"
     exit
 fi
@@ -33,6 +39,15 @@ fi
 # if CUDA version 12 install pytorch for 12.1, else if CUDA 11 install pytorch for 11.8. If ROCm, install pytorch for ROCm 5.7
 read -p "Please enter your GPU compute version, CUDA 11/12 or AMD ROCm (11, 12, rocm): " pytorch_version
 if [ "$pytorch_version" = "11" ]; then
     echo "Installing PyTorch for CUDA 11.8"
     venv/bin/python -m pip install torch --index-url https://download.pytorch.org/whl/cu118 --upgrade
@@ -54,6 +69,7 @@ rm download-model.py
 rm -rf exllamav2
 rm start-quant.sh
 rm enter-venv.sh
 # download stuff
 echo "Downloading files"
@@ -71,6 +87,14 @@ venv/bin/python -m pip install -r exllamav2/requirements.txt
 venv/bin/python -m pip install huggingface-hub transformers accelerate
 venv/bin/python -m pip install ./exllamav2
 # create start-quant.sh
 echo "#!/bin/bash" > start-quant.sh
 echo "venv/bin/python exl2-quant.py" >> start-quant.sh
@@ -86,4 +110,4 @@ chmod +x enter-venv.sh
 echo "If you use ctrl+c to stop, you may need to also use 'pkill python' to stop running scripts."
 echo "Environment setup complete. run start-quant.sh to start the quantization process."
 read -p "Press enter to exit"
-exit

 # check if "venv" subdirectory exists, if not, create one
 if [ ! -d "venv" ]; then
+    python -m venv venv
 else
+    read -p "venv directory already exists. Looking to upgrade/reinstall exllama? (will reinstall python venv) (y/n) " reinst
+    if [ "$reinst" = "y" ]; then
+        rm -rf venv
+        python -m venv venv
+    else
+        exit
+    fi
 fi
 # ask if the user has git installed
 if [ "$gitwget" = "y" ]; then
     echo "Setting up environment"
 else
+    echo "Please install git and wget from your distro's package manager before running this script."
+    echo "Example for Debian-based: sudo apt-get install git wget"
+    echo "Example for Arch-based: sudo pacman -S git wget"
     read -p "Press enter to continue"
     exit
 fi
 # if CUDA version 12 install pytorch for 12.1, else if CUDA 11 install pytorch for 11.8. If ROCm, install pytorch for ROCm 5.7
 read -p "Please enter your GPU compute version, CUDA 11/12 or AMD ROCm (11, 12, rocm): " pytorch_version
+# ask to install flash attention
+echo "Flash attention is a feature that could fix overflow issues on some more broken models."
+read -p "Would you like to install flash-attention? (rarely needed and optional) (y/n) " flash_attention
+if [ "$flash_attention" != "y" ] && [ "$flash_attention" != "n" ]; then
+    echo "Invalid input. Please enter y or n."
+    read -p "Press enter to continue"
+    exit
+fi
 if [ "$pytorch_version" = "11" ]; then
     echo "Installing PyTorch for CUDA 11.8"
     venv/bin/python -m pip install torch --index-url https://download.pytorch.org/whl/cu118 --upgrade
 rm -rf exllamav2
 rm start-quant.sh
 rm enter-venv.sh
+rm -rf flash-attention
 # download stuff
 echo "Downloading files"
 venv/bin/python -m pip install huggingface-hub transformers accelerate
 venv/bin/python -m pip install ./exllamav2
+if [ "$flash_attention" = "y" ]; then
+    echo "Installing flash-attention..."
+    echo "If failed, retry without flash-attention."
+    git clone https://github.com/Dao-AILab/flash-attention
+    venv/bin/python -m pip install ./flash-attention
+    rm -rf flash-attention
+fi
 # create start-quant.sh
 echo "#!/bin/bash" > start-quant.sh
 echo "venv/bin/python exl2-quant.py" >> start-quant.sh
 echo "If you use ctrl+c to stop, you may need to also use 'pkill python' to stop running scripts."
 echo "Environment setup complete. run start-quant.sh to start the quantization process."
 read -p "Press enter to exit"
+exit

auto-exl2-upload/windows-setup.bat CHANGED Viewed

@@ -6,8 +6,12 @@ REM check if "venv" subdirectory exists, if not, create one
 if not exist "venv\" (
     python -m venv venv
 ) else (
-    echo venv directory already exists. If something is broken, delete everything but exl2-quant.py and run this script again.
-    pause
     exit
 )
@@ -36,6 +40,15 @@ echo CUDA compilers:
 where nvcc
 set /p cuda_version="Please enter your CUDA version (11 or 12): "
 if "%cuda_version%"=="11" (
     echo Installing PyTorch for CUDA 11.8...
     venv\scripts\python.exe -m pip install torch --index-url https://download.pytorch.org/whl/cu118 --upgrade
@@ -48,13 +61,13 @@ if "%cuda_version%"=="11" (
     exit
 )
 echo Deleting potential conflicting files
 del convert-to-safetensors.py
 del download-model.py
 rmdir /s /q exllamav2
 del start-quant.sh
 del enter-venv.sh
 REM download stuff
 echo Downloading files...
@@ -72,6 +85,14 @@ venv\scripts\python.exe -m pip install -r exllamav2/requirements.txt
 venv\scripts\python.exe -m pip install huggingface-hub transformers accelerate
 venv\scripts\python.exe -m pip install .\exllamav2
 REM create start-quant-windows.bat
 echo @echo off > start-quant.bat
 echo venv\scripts\python.exe exl2-quant.py >> start-quant.bat

 if not exist "venv\" (
     python -m venv venv
 ) else (
+    set /p reinst="venv directory already exists. Looking to upgrade/reinstall exllama? (will reinstall python venv) (y/n) "
+)
+if "%reinst%"=="y" (
+    rmdir /s /q venv
+    python -m venv venv
+) else (
     exit
 )
 where nvcc
 set /p cuda_version="Please enter your CUDA version (11 or 12): "
+REM ask to install flash attention
+echo Flash attention is a feature that could fix overflow issues on some more broken models. However it will increase install time by a few hours.
+set /p flash_attention="Would you like to install flash-attention? (rarely needed and optional) (y/n) "
+if not "%flash_attention%"=="y" if not "%flash_attention%"=="n" (
+    echo Invalid input. Please enter y or n.
+    pause
+    exit
+)
 if "%cuda_version%"=="11" (
     echo Installing PyTorch for CUDA 11.8...
     venv\scripts\python.exe -m pip install torch --index-url https://download.pytorch.org/whl/cu118 --upgrade
     exit
 )
 echo Deleting potential conflicting files
 del convert-to-safetensors.py
 del download-model.py
 rmdir /s /q exllamav2
 del start-quant.sh
 del enter-venv.sh
+rmdir /s /q flash-attention
 REM download stuff
 echo Downloading files...
 venv\scripts\python.exe -m pip install huggingface-hub transformers accelerate
 venv\scripts\python.exe -m pip install .\exllamav2
+if "%flash_attention%"=="y" (
+    echo Installing flash-attention. Go watch some movies, this will take a while...
+    echo If failed, retry without flash-attention.
+    git clone https://github.com/Dao-AILab/flash-attention
+    venv\scripts\python.exe -m pip install .\flash-attention
+    rmdir /s /q flash-attention
+)
 REM create start-quant-windows.bat
 echo @echo off > start-quant.bat
 echo venv\scripts\python.exe exl2-quant.py >> start-quant.bat

exl2-multi-quant-local/INSTRUCTIONS.txt CHANGED Viewed

@@ -1,14 +1,14 @@
 For NVIDIA cards install the CUDA toolkit
 Nvidia Maxwell or higher
-https://developer.nvidia.com/cuda-downloads
 Nvidia Kepler or higher
 https://developer.nvidia.com/cuda-11-8-0-download-archive
 Restart your computer after installing the CUDA toolkit to make sure the PATH is set correctly.
-Haven't done much testing but for Windows, Visual Studio 2019 with desktop development for C++ might be required.
 https://visualstudio.microsoft.com/thank-you-downloading-visual-studio/?sku=community&rel=16&utm_medium=microsoft&utm_campaign=download+from+relnotes&utm_content=vs2019ga+button
 install the desktop development for C++ workload
@@ -19,11 +19,11 @@ For example, on Ubuntu use: sudo apt-get install build-essential
 This may work with AMD cards but only on linux and possibly WSL2. I can't guarantee that it will work on AMD cards, I personally don't have one to test with. You may need to install stuff before starting. https://rocm.docs.amd.com/projects/install-on-linux/en/latest/tutorial/quick-start.html
-Only python 3.8 - 3.11 is known to work. If you have a higher version of python, I can't guarantee that it will work.
-First setup your environment by using either windows.bat or linux.sh. If something fails during setup, then delete venv folder and try again.
 After setup is complete then you'll have a file called start-quant. Use this to run the quant script.
@@ -32,7 +32,7 @@ Make sure to also have a lot of RAM depending on the model. Have noticed gemma t
 If you close the terminal or the terminal crashes, check the last BPW it was on and enter the remaining quants you wanted. It should be able to pick up where it left off. Don't type the BPW of completed quants as it will start from the beginning. You may also use ctrl + c to pause at any time during the quant process.
-To add more options to the quantization process, you can add them to line 136. All options: https://github.com/turboderp/exllamav2/blob/master/doc/convert.md
 Things may break in the future as it downloads the latest version of all the dependencies which may either change names or how they work. If something breaks, please open a discussion at https://huggingface.co/Anthonyg5005/hf-scripts/discussions
@@ -46,4 +46,4 @@ https://github.com/oobabooga
 Credit to Lucain Pouget for maintaining huggingface-hub.
 https://github.com/Wauplin
-Only tested with CUDA 12.1 on Windows 11

 For NVIDIA cards install the CUDA toolkit
 Nvidia Maxwell or higher
+https://developer.nvidia.com/cuda-12-1-0-download-archive
 Nvidia Kepler or higher
 https://developer.nvidia.com/cuda-11-8-0-download-archive
 Restart your computer after installing the CUDA toolkit to make sure the PATH is set correctly.
+Visual Studio with desktop development for C++ is required.
 https://visualstudio.microsoft.com/thank-you-downloading-visual-studio/?sku=community&rel=16&utm_medium=microsoft&utm_campaign=download+from+relnotes&utm_content=vs2019ga+button
 install the desktop development for C++ workload
 This may work with AMD cards but only on linux and possibly WSL2. I can't guarantee that it will work on AMD cards, I personally don't have one to test with. You may need to install stuff before starting. https://rocm.docs.amd.com/projects/install-on-linux/en/latest/tutorial/quick-start.html
+Only python 3.8 - 3.12 is known to work. If you have a higher/lower version of python, I can't guarantee that it will work.
+First setup your environment by using either windows.bat or linux.sh.
 After setup is complete then you'll have a file called start-quant. Use this to run the quant script.
 If you close the terminal or the terminal crashes, check the last BPW it was on and enter the remaining quants you wanted. It should be able to pick up where it left off. Don't type the BPW of completed quants as it will start from the beginning. You may also use ctrl + c to pause at any time during the quant process.
+To add more options to the quantization process, you can add them to line 140. All options: https://github.com/turboderp/exllamav2/blob/master/doc/convert.md
 Things may break in the future as it downloads the latest version of all the dependencies which may either change names or how they work. If something breaks, please open a discussion at https://huggingface.co/Anthonyg5005/hf-scripts/discussions
 Credit to Lucain Pouget for maintaining huggingface-hub.
 https://github.com/Wauplin
+Only tested with CUDA 12.1 on Windows 11 and WSL2 Ubuntu 24.04

exl2-multi-quant-local/exl2-quant.py CHANGED Viewed

@@ -85,9 +85,13 @@ bpwvalue = list(qnum.values())
 bpwvalue.sort()
 #ask to delete fp16 after done
-delmodel = input("Do you want to delete the original model after finishing? (Won't delete if paused or failed) (y/n): ")
 while delmodel != 'y' and delmodel != 'n':
     delmodel = input("Please enter 'y' or 'n': ")
 if delmodel == 'y':
     print(f"Deleting dir models/{model} after quants are finished.")
     time.sleep(3)

 bpwvalue.sort()
 #ask to delete fp16 after done
+delmodel = input("Do you want to delete the original model? (Won't delete if paused or failed) (y/N): ")
+if delmodel == '':
+    delmodel = 'n'
 while delmodel != 'y' and delmodel != 'n':
     delmodel = input("Please enter 'y' or 'n': ")
+    if delmodel == '':
+        delmodel = 'n'
 if delmodel == 'y':
     print(f"Deleting dir models/{model} after quants are finished.")
     time.sleep(3)

exl2-multi-quant-local/linux-setup.sh CHANGED Viewed

@@ -4,11 +4,15 @@
 # check if "venv" subdirectory exists, if not, create one
 if [ ! -d "venv" ]; then
-    python3 -m venv venv
 else
-    echo "venv directory already exists. If something is broken, delete everything but exl2-quant.py and run this script again."
-    read -p "Press enter to continue"
-    exit
 fi
 # ask if the user has git installed
@@ -17,7 +21,9 @@ read -p "Do you have git and wget installed? (y/n) " gitwget
 if [ "$gitwget" = "y" ]; then
     echo "Setting up environment"
 else
-    echo "Please install git and wget before running this script."
     read -p "Press enter to continue"
     exit
 fi
@@ -33,6 +39,15 @@ fi
 # if CUDA version 12 install pytorch for 12.1, else if CUDA 11 install pytorch for 11.8. If ROCm, install pytorch for ROCm 5.7
 read -p "Please enter your GPU compute version, CUDA 11/12 or AMD ROCm (11, 12, rocm): " pytorch_version
 if [ "$pytorch_version" = "11" ]; then
     echo "Installing PyTorch for CUDA 11.8"
     venv/bin/python -m pip install torch --index-url https://download.pytorch.org/whl/cu118 --upgrade
@@ -54,6 +69,7 @@ rm download-model.py
 rm -rf exllamav2
 rm start-quant.sh
 rm enter-venv.sh
 # download stuff
 echo "Downloading files"
@@ -71,6 +87,14 @@ venv/bin/python -m pip install -r exllamav2/requirements.txt
 venv/bin/python -m pip install huggingface-hub transformers accelerate
 venv/bin/python -m pip install ./exllamav2
 # create start-quant.sh
 echo "#!/bin/bash" > start-quant.sh
 echo "venv/bin/python exl2-quant.py" >> start-quant.sh

 # check if "venv" subdirectory exists, if not, create one
 if [ ! -d "venv" ]; then
+    python -m venv venv
 else
+    read -p "venv directory already exists. Looking to upgrade/reinstall exllama? (will reinstall python venv) (y/n) " reinst
+    if [ "$reinst" = "y" ]; then
+        rm -rf venv
+        python -m venv venv
+    else
+        exit
+    fi
 fi
 # ask if the user has git installed
 if [ "$gitwget" = "y" ]; then
     echo "Setting up environment"
 else
+    echo "Please install git and wget from your distro's package manager before running this script."
+    echo "Example for Debian-based: sudo apt-get install git wget"
+    echo "Example for Arch-based: sudo pacman -S git wget"
     read -p "Press enter to continue"
     exit
 fi
 # if CUDA version 12 install pytorch for 12.1, else if CUDA 11 install pytorch for 11.8. If ROCm, install pytorch for ROCm 5.7
 read -p "Please enter your GPU compute version, CUDA 11/12 or AMD ROCm (11, 12, rocm): " pytorch_version
+# ask to install flash attention
+echo "Flash attention is a feature that could fix overflow issues on some more broken models."
+read -p "Would you like to install flash-attention? (rarely needed and optional) (y/n) " flash_attention
+if [ "$flash_attention" != "y" ] && [ "$flash_attention" != "n" ]; then
+    echo "Invalid input. Please enter y or n."
+    read -p "Press enter to continue"
+    exit
+fi
 if [ "$pytorch_version" = "11" ]; then
     echo "Installing PyTorch for CUDA 11.8"
     venv/bin/python -m pip install torch --index-url https://download.pytorch.org/whl/cu118 --upgrade
 rm -rf exllamav2
 rm start-quant.sh
 rm enter-venv.sh
+rm -rf flash-attention
 # download stuff
 echo "Downloading files"
 venv/bin/python -m pip install huggingface-hub transformers accelerate
 venv/bin/python -m pip install ./exllamav2
+if [ "$flash_attention" = "y" ]; then
+    echo "Installing flash-attention..."
+    echo "If failed, retry without flash-attention."
+    git clone https://github.com/Dao-AILab/flash-attention
+    venv/bin/python -m pip install ./flash-attention
+    rm -rf flash-attention
+fi
 # create start-quant.sh
 echo "#!/bin/bash" > start-quant.sh
 echo "venv/bin/python exl2-quant.py" >> start-quant.sh

exl2-multi-quant-local/windows-setup.bat CHANGED Viewed

@@ -6,8 +6,12 @@ REM check if "venv" subdirectory exists, if not, create one
 if not exist "venv\" (
     python -m venv venv
 ) else (
-    echo venv directory already exists. If something is broken, delete everything but exl2-quant.py and run this script again.
-    pause
     exit
 )
@@ -36,6 +40,15 @@ echo CUDA compilers:
 where nvcc
 set /p cuda_version="Please enter your CUDA version (11 or 12): "
 if "%cuda_version%"=="11" (
     echo Installing PyTorch for CUDA 11.8...
     venv\scripts\python.exe -m pip install torch --index-url https://download.pytorch.org/whl/cu118 --upgrade
@@ -54,6 +67,7 @@ del download-model.py
 rmdir /s /q exllamav2
 del start-quant.sh
 del enter-venv.sh
 REM download stuff
 echo Downloading files...
@@ -71,6 +85,14 @@ venv\scripts\python.exe -m pip install -r exllamav2/requirements.txt
 venv\scripts\python.exe -m pip install huggingface-hub transformers accelerate
 venv\scripts\python.exe -m pip install .\exllamav2
 REM create start-quant-windows.bat
 echo @echo off > start-quant.bat
 echo venv\scripts\python.exe exl2-quant.py >> start-quant.bat

 if not exist "venv\" (
     python -m venv venv
 ) else (
+    set /p reinst="venv directory already exists. Looking to upgrade/reinstall exllama? (will reinstall python venv) (y/n) "
+)
+if "%reinst%"=="y" (
+    rmdir /s /q venv
+    python -m venv venv
+) else (
     exit
 )
 where nvcc
 set /p cuda_version="Please enter your CUDA version (11 or 12): "
+REM ask to install flash attention
+echo Flash attention is a feature that could fix overflow issues on some more broken models. However it will increase install time by a few hours.
+set /p flash_attention="Would you like to install flash-attention? (rarely needed and optional) (y/n) "
+if not "%flash_attention%"=="y" if not "%flash_attention%"=="n" (
+    echo Invalid input. Please enter y or n.
+    pause
+    exit
+)
 if "%cuda_version%"=="11" (
     echo Installing PyTorch for CUDA 11.8...
     venv\scripts\python.exe -m pip install torch --index-url https://download.pytorch.org/whl/cu118 --upgrade
 rmdir /s /q exllamav2
 del start-quant.sh
 del enter-venv.sh
+rmdir /s /q flash-attention
 REM download stuff
 echo Downloading files...
 venv\scripts\python.exe -m pip install huggingface-hub transformers accelerate
 venv\scripts\python.exe -m pip install .\exllamav2
+if "%flash_attention%"=="y" (
+    echo Installing flash-attention. Go watch some movies, this will take a while...
+    echo If failed, retry without flash-attention.
+    git clone https://github.com/Dao-AILab/flash-attention
+    venv\scripts\python.exe -m pip install .\flash-attention
+    rmdir /s /q flash-attention
+)
 REM create start-quant-windows.bat
 echo @echo off > start-quant.bat
 echo venv\scripts\python.exe exl2-quant.py >> start-quant.bat