JustinLin610
commited on
Commit
•
616cc3f
1
Parent(s):
d448a78
Update README.md
Browse files
README.md
CHANGED
@@ -43,9 +43,11 @@ To run Qwen2, you can use `llama-cli` (the previous `main`) or `llama-server` (t
|
|
43 |
We recommend using the `llama-server` as it is simple and compatible with OpenAI API. For example:
|
44 |
|
45 |
```bash
|
46 |
-
./llama-server -m qwen2-0.5b-instruct-q5_k_m.gguf
|
47 |
```
|
48 |
|
|
|
|
|
49 |
Then it is easy to access the deployed service with OpenAI API:
|
50 |
|
51 |
```python
|
@@ -69,7 +71,11 @@ print(completion.choices[0].message.content)
|
|
69 |
If you choose to use `llama-cli`, pay attention to the removal of `-cml` for the ChatML template. Instead you should use `--in-prefix` and `--in-suffix` to tackle this problem.
|
70 |
|
71 |
```bash
|
72 |
-
./llama-cli -m qwen2-0.5b-instruct-q5_k_m.gguf
|
|
|
|
|
|
|
|
|
73 |
```
|
74 |
|
75 |
## Citation
|
|
|
43 |
We recommend using the `llama-server` as it is simple and compatible with OpenAI API. For example:
|
44 |
|
45 |
```bash
|
46 |
+
./llama-server -m qwen2-0.5b-instruct-q5_k_m.gguf -ngl 24 -fa
|
47 |
```
|
48 |
|
49 |
+
(Note: `-ngl 24` refers to offloading 24 layers to GPUs, and `-fa` refers to the use of flash attention.)
|
50 |
+
|
51 |
Then it is easy to access the deployed service with OpenAI API:
|
52 |
|
53 |
```python
|
|
|
71 |
If you choose to use `llama-cli`, pay attention to the removal of `-cml` for the ChatML template. Instead you should use `--in-prefix` and `--in-suffix` to tackle this problem.
|
72 |
|
73 |
```bash
|
74 |
+
./llama-cli -m qwen2-0.5b-instruct-q5_k_m.gguf \
|
75 |
+
-n 512 -co -i -if -f prompts/chat-with-qwen.txt \
|
76 |
+
--in-prefix "<|im_start|>user\n" \
|
77 |
+
--in-suffix "<|im_end|>\n<|im_start|>assistant\n" \
|
78 |
+
-ngl 24 -fa
|
79 |
```
|
80 |
|
81 |
## Citation
|