keitokei1994
/

shisa-v1-qwen2-7b-GGUF

Inference Endpoints

Model card Files Files and versions Community

keitokei1994 commited on Jun 10, 2024

Commit

250db7a

·

verified ·

1 Parent(s): 2148ff1

Update README.md

Files changed (1) hide show

README.md +18 -2

README.md CHANGED Viewed

@@ -6,7 +6,7 @@ language:
 - ja
 - en
 ---
-# shisa-v1-qwen2-7b-gguf
 [shisa-aiさんが公開しているshisa-v1-qwen2-7b](https://huggingface.co/shisa-ai/shisa-v1-qwen2-7b)のggufフォーマット変換版です。
 # Notice
@@ -20,4 +20,20 @@ language:
     2. 以下のようなコマンドでFlashAttentionを有効化して実行します:
        ```
        ./server -m ./models/shisa-v1-qwen2-7b.Q8_0.gguf -ngl 99 --port 8888 -fa
-       ```

 - ja
 - en
 ---
+# shisa-v1-qwen2-7b-gguf (English explanation is below.)
 [shisa-aiさんが公開しているshisa-v1-qwen2-7b](https://huggingface.co/shisa-ai/shisa-v1-qwen2-7b)のggufフォーマット変換版です。
 # Notice
     2. 以下のようなコマンドでFlashAttentionを有効化して実行します:
        ```
        ./server -m ./models/shisa-v1-qwen2-7b.Q8_0.gguf -ngl 99 --port 8888 -fa
+       ```
+# shisa-v1-qwen2-7b-gguf
+This is a gguf format conversion of [shisa-v1-qwen2-7b](https://huggingface.co/shisa-ai/shisa-v1-qwen2-7b) published by shisa-ai.
+# Notice
+* Currently, there is a bug where the output gets corrupted when trying to run models based on the qwen2-7B series in GGUF format. This can be avoided by enabling Flash Attention.
+  * If using LMStudio, please enable Flash Attention from the Preset.
+  * If using Llama.cpp, please follow these steps:
+    1. Build with the following command:
+      ```
+      make LLAMA_CUDA_FA_ALL_QUANTS=true && LLAMA_CUDA=1
+      ```
+    2. Run with Flash Attention enabled using a command like this:
+      ```
+      ./server -m ./models/shisa-v1-qwen2-7b.Q8_0.gguf -ngl 99 --port 8888 -fa
+      ```