openchat/openchat-3.5-0106 · Issue with Unexpected Tokens and Long Responses in Model Outputs

Hello,

I've been experiencing an issue with the openchat-3.5-0106 model where it adds unexpected tokens after completing its responses. Additionally, if the max_tokens parameter isn't set to limit the output, the model continues to generate a lengthy response filled with meaningless tokens. Here’s an example of the issue:

Prompt: Ist FSME tödlich?
Response: FSME kann tödlich werden, wenn nicht rechtzeitig und richtig behandelt wird. In den Dokumenten gibt es keine genaue Angabe über die Todesrate, aber es wird darauf hingewiesen, dass eine Infektion mit tödlichen Folgen einhergehen kann.GPT4G.GPT4Ihre Anfrage kann nicht mit den bereitgestellten Daten beantwortet werden.G.G.G.G.G.G.G.G.G.G.G.G.G.G.G.G.G.G.G.G.G.G.G.G.G.G.G.G.G.G.G.G.G.G.G.G.G.G.G.G.G.G.G.G.G.G.G.G.G.G.G.G.G.G.G.G.G.G.G.G.G.G.G.G.G.G.G.G.G.G.G.G.G. [Here, the output continues with repeated, nonsensical tokens like "GPT4G.GPT4..." repeating many times.]

Interestingly, I tested the https://huggingface.co/Nexusflow/Starling-LM-7B-beta and did not encounter these issues. The response was clean and free from odd token repetitions. This leads me to believe there may be an underlying issue specific to the openchat-3.5-0106 model or my setup?

My setup is VM in Azure, tried using both vLLM and aphrodite-engine with the same reults.
logs:
INFO: Received request cmpl-907a5797f8c748d095d1812aebb5ad87: prompt: "<s>GPT4 Correct System: Please generate an ANSWER that strictly adheres to the given CONTEXT and accurately addresses the QUESTION asked, without adding any of the model's own information. If the required
information is not found in the CONTEXT, respond in German with: 'Ihre Anfrage kann nicht mit den bereitgestellten Daten beantwortet werden. Bitte erläutern Sie Ihre Anfrage genauer oder geben Sie weitere Informationen an, falls notwendig'. Avoid references to previous outputs of
the model. The answer should be based solely on the provided CONTEXT. If the QUESTION does not directly relate to a health-related topic or is not clearly answerable, briefly explain why the query cannot be answered and recommend a more precise formulation or additional
information.\n\nCONTEXT:\n[Doc Nr. 1]...nRoesebeckstr.<|end_of_turn|>GPT4 Correct User: Ist FSME tödlich?<|end_of_turn|>GPT4 Correct Assistant:", sampling params: SamplingParams(temperature=0.7, max_tokens=6510), prompt token ids: [1, 1, 420,
6316, 28781, 3198, 3123, 2135, 28747, 5919, 8270, 396, 2976, 9854, 725, 369, 19470, 616... 28747], lora_request: None.
INFO: Avg prompt throughput: 181.0 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 1.6%, CPU KV cache usage: 0.0%
INFO: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 88.3 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 2.0%, CPU KV cache usage: 0.0%
INFO: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 87.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 2.4%, CPU KV cache usage: 0.0%
INFO: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 86.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 2.8%, CPU KV cache usage: 0.0%
INFO: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 84.9 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 3.2%, CPU KV cache usage: 0.0%
INFO: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 84.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 3.5%, CPU KV cache usage: 0.0%
INFO: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 83.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 3.8%, CPU KV cache usage: 0.0%
INFO: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 82.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 3.8%, CPU KV cache usage: 0.0%
INFO: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 82.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 3.8%, CPU KV cache usage: 0.0%
INFO: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 82.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 3.8%, CPU KV cache usage: 0.0%
INFO: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 82.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 3.8%, CPU KV cache usage: 0.0%
INFO: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 82.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 3.8%, CPU KV cache usage: 0.0%
INFO: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 82.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 3.8%, CPU KV cache usage: 0.0%
INFO: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 81.9 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 3.8%, CPU KV cache usage: 0.0%
INFO: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 81.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 3.8%, CPU KV cache usage: 0.0%
INFO: Finished request cmpl-907a5797f8c748d095d1812aebb5ad87.