feihu.hf commited on
Commit
54e5483
·
1 Parent(s): f346172

update README.md

Browse files
Files changed (1) hide show
  1. README.md +17 -17
README.md CHANGED
@@ -79,22 +79,22 @@ For deployment, we recommend using vLLM. You can enable long-context capabilitie
79
  1. **Install vLLM**: Ensure you have the latest version from the main branch of [vLLM](https://github.com/vllm-project/vllm).
80
 
81
  2. **Configure Model Settings**: After downloading the model weights, modify the `config.json` file by including the below snippet:
82
- ```json5
83
  {
84
- "architectures": [
85
- "Qwen2ForCausalLM"
86
- ],
87
- // ...
88
- "vocab_size": 152064,
89
-
90
- // adding the following snippets
91
- "rope_scaling": {
92
- "factor": 4.0,
93
- "original_max_position_embeddings": 32768,
94
- "type": "yarn"
 
95
  }
96
- }
97
- ```
98
  This snippet enable YARN to support longer contexts.
99
 
100
  3. **Model Deployment**: Utilize vLLM to deploy your model. For instance, you can set up an openAI-like server using the command:
@@ -111,15 +111,15 @@ For deployment, we recommend using vLLM. You can enable long-context capabilitie
111
  -d '{
112
  "model": "Qwen2-72B-Instruct",
113
  "messages": [
114
- {"role": "system", "content": "You are a helpful assistant."},
115
- {"role": "user", "content": "Your Long Input Here."}
116
  ]
117
  }'
118
  ```
119
 
120
  For further usage instructions of vLLM, please refer to our [Github](https://github.com/QwenLM/Qwen2).
121
 
122
- **Note**: Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, potentially impacting performance on shorter texts. We advise adding the `rope_scaling` configuration only when processing long contexts is required.
123
 
124
  ## Citation
125
 
 
79
  1. **Install vLLM**: Ensure you have the latest version from the main branch of [vLLM](https://github.com/vllm-project/vllm).
80
 
81
  2. **Configure Model Settings**: After downloading the model weights, modify the `config.json` file by including the below snippet:
82
+ ```json
83
  {
84
+ "architectures": [
85
+ "Qwen2ForCausalLM"
86
+ ],
87
+ // ...
88
+ "vocab_size": 152064,
89
+
90
+ // adding the following snippets
91
+ "rope_scaling": {
92
+ "factor": 4.0,
93
+ "original_max_position_embeddings": 32768,
94
+ "type": "yarn"
95
+ }
96
  }
97
+ ```
 
98
  This snippet enable YARN to support longer contexts.
99
 
100
  3. **Model Deployment**: Utilize vLLM to deploy your model. For instance, you can set up an openAI-like server using the command:
 
111
  -d '{
112
  "model": "Qwen2-72B-Instruct",
113
  "messages": [
114
+ {"role": "system", "content": "You are a helpful assistant."},
115
+ {"role": "user", "content": "Your Long Input Here."}
116
  ]
117
  }'
118
  ```
119
 
120
  For further usage instructions of vLLM, please refer to our [Github](https://github.com/QwenLM/Qwen2).
121
 
122
+ **Note**: Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, **potentially impacting performance on shorter texts**. We advise adding the `rope_scaling` configuration only when processing long contexts is required.
123
 
124
  ## Citation
125