Tool calls broken; Link to fix

#2
by apresence - opened

FYI, tool call tokens are broken for the current version of the models in this repo. The tool call tokens are hidden from the output due to being marked as special tokens.

Example:

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>system name=<|plugin|>
[{"name": "generate_image", "description": "Generates an image based on the given text prompt", "parameters": {"type": "object", "properties": {"prompt": {"type": "string", "description": "The text prompt used to guide image generation"}}, "required": ["prompt"]}}]<|im_end|>
<|im_start|>user
Draw a picture of a kitten.<|im_end|>
<|im_start|>assistant
I will call an image generation api to generate image
{"name": "generate_image", "parameters": {"prompt": "A cute and playful kitten with big, round eyes, sitting on a fluffy pillow, in a soft, pastel color palette, impressionism style, high resolution, with a warm, cozy atmosphere."}}

Notice the tokens are missing. This is the expected output:

<|im_start|>assistant
I will call an image generation api to generate image<|action_start|><|plugin|>
{"name": "generate_image", "parameters": {"prompt": "A cute and playful kitten with big, round eyes, sitting on a fluffy pillow, in a soft, pastel color palette, impressionism style, high resolution, with a warm, cozy atmosphere."}}<|action_end|>

I have posted corrected versions here.

I've fixed the transformers version of the model as well and posted it here.

Thanks!

InternLM org

@apresence hi, thank you for the feedback. We'll try to fix it later.

@apresence hi, thank you for the feedback. We'll try to fix it later.

You're welcome to look at the code and settings changes from the repo I stood up.

Let me know if I can help!

Also, if I may, I'd like to share an another issue with you. It seems that your model uses dynamic RoPE by default. It also supports linear, but when I tried that, the perplexity was quite bad. I notice when running the model in llama.cpp, which does not support dynamic and thus has to run linear, that the perplexity is worse than dynamic in transformers. That is, it starts out great, but after a few back-and-forth interactions it starts to degrade, eventually repeating itself and forgetting information. This is when the prompt is still well within the configured context length. I'd love to hear your team's insights on the issue, and any ideas about how to address it.

Thanks for the great model!

@apresence hi, thank you for the feedback. We'll try to fix it later.

Another

apresence changed discussion status to closed

@apresence hi,

1. special tokens missing issue

After verification, there seems to be a misunderstanding about The tool call tokens are hidden from the output due to being marked as special tokens.
Special tokens are not showing by default. If you add --special to llama-cli, then you can see the full output string with special tokens. Besides, you can also check it in log file if you add --logdir

command

build/bin/llama-cli \
    --model internlm2_5-7b-chat-fp16.gguf \
    --predict 512 \
    --ctx-size 4096 \
    --gpu-layers 32 \
    --temp 0.8 \
    --top-p 0.8 \
    --top-k 50 \
    --seed 1024 \
    --color \
    --prompt '<|im_start|>system\nYou are a harmless AI assistant.<|im_end|>\n<|im_start|>system name=<|plugin|>[{"name": "generate_image", "description": "Generates an image based on the given text prompt", "parameters": {"type": "object", "properties": {"prompt": {"type": "string", "description": "The text prompt used to guide image generation"}}, "required": ["prompt"]}}]<|im_end|>\n' \
    --interactive \
    --multiline-input \
    --conversation \
    --verbose \
    --logdir ./logdir \
    --in-suffix "<|im_end|>\n<|im_start|>assistant\n" \
    --special

Here are the conversations

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - To return control to the AI, end your input with '\'.
 - To return control without starting a new line, end your input with '/'.

<s><|im_start|>system
You are a harmless AI assistant.<|im_end|>
<|im_start|>system name=<|plugin|>[{"name": "generate_image", "description": "Generates an image based on the given text prompt", "parameters": {"type": "object", "properties": {"prompt": {"type": "string", "description": "The text prompt used to guide image generation"}}, "required": ["prompt"]}}]<|im_end|>

> <|im_start|>user\nDraw a picture of a kitten.
I will call an image generation api to generate image<|action_start|><|plugin|>[{"name": "generate_image", "parameters": {"prompt": "A cute, fluffy kitten with big round eyes, sitting on a soft cushion, warm and cozy, pastel colors, impressionism, high resolution, captured on a DSLR camera, natural lighting, detailed fur texture."}}]<|im_end|>

As you can see, the special tokens are not missing.

2. llama.cpp does not support dynamic rope scaling

this is true and there is an open issue https://github.com/ggerganov/llama.cpp/issues/8361 . No response by far.

Thank you for taking the time to address this topic.

You are right, llama-cli does show the tokens when the --special flag is used. However, I discovered the issue originally with the /completion endpoint of llama-server. I just happened to use llama-cli to demonstrate the issue because it was easy to provide output that others could follow and verify on their own. As an interesting note, unlike the HF generate() function, I don't see an option for llama-cli to hide/unhide special tokens, either as a command line option (since I prove below that --special is ignored) or json arguments in the API call itself. The only way I am aware to change the behavior is to modify GGUF metadata. That is exactly what I did, and the reason I posted models with those changes applied.

Let's remove llama-cli from the equation. To that end, I've written and used a little test program to call the /completion endpoint and demonstrate the issue. Below are clips of the output for different scenarios. I can provide the script and command line parameters upon request.

For the record, this is the version of llama-server I used for these tests:

$ ./llama-server --version
version: 3368 (dd07a123)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

Without tool call fix

The tool call tokens are never included regardless of the --special option.

With --special

[ SYS ] === TEST MODEL: internlm.internlm2_5-7b-chat-q4_k_m.gguf ===
[ SYS ] Args: ['./llama-server', '--model', 'internlm.internlm2_5-7b-chat-q4_k_m.gguf', '--host', '127.0.0.1', '--port', '52756', '--gpu_layers', '32', '--split_mode', 'none', '--special']
[ <<< ] '<|im_start|>system\nYou are InternLM2-Chat, a harmless AI assistant<|im_end|>\n<|im_start|>system name=<|plugin|>\n[\n{\n"name": "get_current_weather",\n"description": "Get the current weather in a given location",\n"parameters": {\n"type": "object",\n"properties": {\n"location": {\n"type": "string",\n"description": "The city and state, e.g. San Francisco, CA",\n},\n"unit": {"type": "string"},\n},\n"required": ["location"],\n},\n}\n]\n<|im_end|>\n<|im_start|>user\nI want to know today\'s weather in Shanghai<|im_end|>\n<|im_start|>assistant\n'
[ SYS ] Expected response pattern: '^.*<\\|action_start\\|><|plugin|>\\n{"name": "get_current_weather", "parameters": {"location": "Shanghai"}}<\\|action_end\\|><\\|im_end\\|>$'
[ >>> ] 'I need to use the get_current_weather function to get the current weather in Shanghai.\n{"name": "get_current_weather", "parameters": {"location": "Shanghai"}}\n'
[ SYS ] Overall result for 'internlm.internlm2_5-7b-chat-q4_k_m.gguf': FAIL
[ SYS ] Reason for result: Response does not match expected pattern

Without --special

[ SYS ] === TEST MODEL: internlm.internlm2_5-7b-chat-q4_k_m.gguf ===
[ SYS ] Args: ['./llama-server', '--model', 'internlm.internlm2_5-7b-chat-q4_k_m.gguf', '--host', '127.0.0.1', '--port', '52756', '--gpu_layers', '32', '--split_mode', 'none']
[ >>> ] '<|im_start|>system\nYou are InternLM2-Chat, a harmless AI assistant<|im_end|>\n<|im_start|>system name=<|plugin|>\n[\n{\n"name": "get_current_weather",\n"description": "Get the current weather in a given location",\n"parameters": {\n"type": "object",\n"properties": {\n"location": {\n"type": "string",\n"description": "The city and state, e.g. San Francisco, CA",\n},\n"unit": {"type": "string"},\n},\n"required": ["location"],\n},\n}\n]\n<|im_end|>\n<|im_start|>user\nI want to know today\'s weather in Shanghai<|im_end|>\n<|im_start|>assistant\n'
[ SYS ] Expected response pattern: '^.*<\\|action_start\\|><|plugin|>\\n{"name": "get_current_weather", "parameters": {"location": "Shanghai"}}<\\|action_end\\|><\\|im_end\\|>$'
[ <<< ] 'To fulfill your request, I need to use the \\"get_current_weather\\" function and provide the location parameter as \\"Shanghai\\". I will also specify the unit of measurement as \\"metric\\" to ensure accuracy.\n{"name": "get_current_weather", "parameters": {"location": "Shanghai", "unit": "metric"}}\n'
[ SYS ] Overall result for 'internlm.internlm2_5-7b-chat-q4_k_m.gguf': FAIL
[ SYS ] Reason for result: Response does not match expected pattern

With tool call fix

The tool call tokens are always included regardless of the --special option.

With --special

[ SYS ] === TEST MODEL: apresence.internlm2_5-7b-chat-Q4_K_M.gguf ===
[ SYS ] Args: ['./llama-server', '--model', 'apresence.internlm2_5-7b-chat-Q4_K_M.gguf', '--host', '127.0.0.1', '--port', '52756', '--gpu_layers', '32', '--split_mode', 'none', '--special']
[ <<< ] '<|im_start|>system\nYou are InternLM2-Chat, a harmless AI assistant<|im_end|>\n<|im_start|>system name=<|plugin|>\n[\n{\n"name": "get_current_weather",\n"description": "Get the current weather in a given location",\n"parameters": {\n"type": "object",\n"properties": {\n"location": {\n"type": "string",\n"description": "The city and state, e.g. San Francisco, CA",\n},\n"unit": {"type": "string"},\n},\n"required": ["location"],\n},\n}\n]\n<|im_end|>\n<|im_start|>user\nI want to know today\'s weather in Shanghai<|im_end|>\n<|im_start|>assistant\n'
[ SYS ] Expected response pattern: '^.*<\\|action_start\\|><|plugin|>\\n{"name": "get_current_weather", "parameters": {"location": "Shanghai"}}<\\|action_end\\|><\\|im_end\\|>$'
[ >>> ] 'I need to use the get_current_weather function to get the current weather in Shanghai.<|action_start|><|plugin|>\n{"name": "get_current_weather", "parameters": {"location": "Shanghai", "unit": "metric"}}<|action_end|>\n'
[ SYS ] Test Result: PASS
[ <<< ] '<|im_start|>environment name=<|plugin|>\n{"temperature": 22}<|im_end|>\n<|im_start|>assistant\n'
[ SYS ] Expected response pattern: '^.*\\b22\\b.*$'
[ >>> ] 'The temperature is currently at 22 degrees Celsius.'
[ SYS ] Test Result: PASS
[ SYS ] Overall result for 'apresence.internlm2_5-7b-chat-Q4_K_M.gguf': PASS

Without --special

[ SYS ] === TEST MODEL: apresence.internlm2_5-7b-chat-Q4_K_M.gguf ===
[ SYS ] Args: ['./llama-server', '--model', 'apresence.internlm2_5-7b-chat-Q4_K_M.gguf', '--host', '127.0.0.1', '--port', '52756', '--gpu_layers', '32', '--split_mode', 'none']
[ >>> ] '<|im_start|>system\nYou are InternLM2-Chat, a harmless AI assistant<|im_end|>\n<|im_start|>system name=<|plugin|>\n[\n{\n"name": "get_current_weather",\n"description": "Get the current weather in a given location",\n"parameters": {\n"type": "object",\n"properties": {\n"location": {\n"type": "string",\n"description": "The city and state, e.g. San Francisco, CA",\n},\n"unit": {"type": "string"},\n},\n"required": ["location"],\n},\n}\n]\n<|im_end|>\n<|im_start|>user\nI want to know today\'s weather in Shanghai<|im_end|>\n<|im_start|>assistant\n'
[ SYS ] Expected response pattern: '^.*<\\|action_start\\|><|plugin|>\\n{"name": "get_current_weather", "parameters": {"location": "Shanghai"}}<\\|action_end\\|><\\|im_end\\|>$'
[ <<< ] 'I need to use the get_current_weather function to get the current weather in Shanghai.<|action_start|><|plugin|>\n{"name": "get_current_weather", "parameters": {"location": "Shanghai"}}<|action_end|>\n'
[ SYS ] Test Result: PASS
[ >>> ] '<|im_start|>environment name=<|plugin|>\n{"temperature": 22}<|im_end|>\n<|im_start|>assistant\n'
[ SYS ] Expected response pattern: '^.*\\b22\\b.*$'
[ >>> ] "It seems you're interested in the temperature, which is currently at 22 degrees Celsius. How can I assist you further today? Is there a specific task or information you need?"
[ SYS ] Test Result: PASS
[ SYS ] Overall result for 'apresence.internlm2_5-7b-chat-Q4_K_M.gguf': PASS

Just after I took the time to test and write all that up, I received a notification of an update for llama.cpp.

There is a fix planned: #8506.

Once the fix is published and verified to be working, I can take down the copies I set up on hf.

I'm really impressed how the community comes together for these things. Thanks everyone!

Thank you for taking the time to address this topic.

You are right, llama-cli does show the tokens when the --special flag is used. However, I discovered the issue originally with the /completion endpoint of llama-server. I just happened to use llama-cli to demonstrate the issue because it was easy to provide output that others could follow and verify on their own. As an interesting note, unlike the HF generate() function, I don't see an option for llama-cli to hide/unhide special tokens, either as a command line option (since I prove below that --special is ignored) or json arguments in the API call itself. The only way I am aware to change the behavior is to modify GGUF metadata. That is exactly what I did, and the reason I posted models with those changes applied.

Let's remove llama-cli from the equation. To that end, I've written and used a little test program to call the /completion endpoint and demonstrate the issue. Below are clips of the output for different scenarios. I can provide the script and command line parameters upon request.

For the record, this is the version of llama-server I used for these tests:

$ ./llama-server --version
version: 3368 (dd07a123)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

Without tool call fix

The tool call tokens are never included regardless of the --special option.

With --special

[ SYS ] === TEST MODEL: internlm.internlm2_5-7b-chat-q4_k_m.gguf ===
[ SYS ] Args: ['./llama-server', '--model', 'internlm.internlm2_5-7b-chat-q4_k_m.gguf', '--host', '127.0.0.1', '--port', '52756', '--gpu_layers', '32', '--split_mode', 'none', '--special']
[ <<< ] '<|im_start|>system\nYou are InternLM2-Chat, a harmless AI assistant<|im_end|>\n<|im_start|>system name=<|plugin|>\n[\n{\n"name": "get_current_weather",\n"description": "Get the current weather in a given location",\n"parameters": {\n"type": "object",\n"properties": {\n"location": {\n"type": "string",\n"description": "The city and state, e.g. San Francisco, CA",\n},\n"unit": {"type": "string"},\n},\n"required": ["location"],\n},\n}\n]\n<|im_end|>\n<|im_start|>user\nI want to know today\'s weather in Shanghai<|im_end|>\n<|im_start|>assistant\n'
[ SYS ] Expected response pattern: '^.*<\\|action_start\\|><|plugin|>\\n{"name": "get_current_weather", "parameters": {"location": "Shanghai"}}<\\|action_end\\|><\\|im_end\\|>$'
[ >>> ] 'I need to use the get_current_weather function to get the current weather in Shanghai.\n{"name": "get_current_weather", "parameters": {"location": "Shanghai"}}\n'
[ SYS ] Overall result for 'internlm.internlm2_5-7b-chat-q4_k_m.gguf': FAIL
[ SYS ] Reason for result: Response does not match expected pattern

Without --special

[ SYS ] === TEST MODEL: internlm.internlm2_5-7b-chat-q4_k_m.gguf ===
[ SYS ] Args: ['./llama-server', '--model', 'internlm.internlm2_5-7b-chat-q4_k_m.gguf', '--host', '127.0.0.1', '--port', '52756', '--gpu_layers', '32', '--split_mode', 'none']
[ >>> ] '<|im_start|>system\nYou are InternLM2-Chat, a harmless AI assistant<|im_end|>\n<|im_start|>system name=<|plugin|>\n[\n{\n"name": "get_current_weather",\n"description": "Get the current weather in a given location",\n"parameters": {\n"type": "object",\n"properties": {\n"location": {\n"type": "string",\n"description": "The city and state, e.g. San Francisco, CA",\n},\n"unit": {"type": "string"},\n},\n"required": ["location"],\n},\n}\n]\n<|im_end|>\n<|im_start|>user\nI want to know today\'s weather in Shanghai<|im_end|>\n<|im_start|>assistant\n'
[ SYS ] Expected response pattern: '^.*<\\|action_start\\|><|plugin|>\\n{"name": "get_current_weather", "parameters": {"location": "Shanghai"}}<\\|action_end\\|><\\|im_end\\|>$'
[ <<< ] 'To fulfill your request, I need to use the \\"get_current_weather\\" function and provide the location parameter as \\"Shanghai\\". I will also specify the unit of measurement as \\"metric\\" to ensure accuracy.\n{"name": "get_current_weather", "parameters": {"location": "Shanghai", "unit": "metric"}}\n'
[ SYS ] Overall result for 'internlm.internlm2_5-7b-chat-q4_k_m.gguf': FAIL
[ SYS ] Reason for result: Response does not match expected pattern

With tool call fix

The tool call tokens are always included regardless of the --special option.

With --special

[ SYS ] === TEST MODEL: apresence.internlm2_5-7b-chat-Q4_K_M.gguf ===
[ SYS ] Args: ['./llama-server', '--model', 'apresence.internlm2_5-7b-chat-Q4_K_M.gguf', '--host', '127.0.0.1', '--port', '52756', '--gpu_layers', '32', '--split_mode', 'none', '--special']
[ <<< ] '<|im_start|>system\nYou are InternLM2-Chat, a harmless AI assistant<|im_end|>\n<|im_start|>system name=<|plugin|>\n[\n{\n"name": "get_current_weather",\n"description": "Get the current weather in a given location",\n"parameters": {\n"type": "object",\n"properties": {\n"location": {\n"type": "string",\n"description": "The city and state, e.g. San Francisco, CA",\n},\n"unit": {"type": "string"},\n},\n"required": ["location"],\n},\n}\n]\n<|im_end|>\n<|im_start|>user\nI want to know today\'s weather in Shanghai<|im_end|>\n<|im_start|>assistant\n'
[ SYS ] Expected response pattern: '^.*<\\|action_start\\|><|plugin|>\\n{"name": "get_current_weather", "parameters": {"location": "Shanghai"}}<\\|action_end\\|><\\|im_end\\|>$'
[ >>> ] 'I need to use the get_current_weather function to get the current weather in Shanghai.<|action_start|><|plugin|>\n{"name": "get_current_weather", "parameters": {"location": "Shanghai", "unit": "metric"}}<|action_end|>\n'
[ SYS ] Test Result: PASS
[ <<< ] '<|im_start|>environment name=<|plugin|>\n{"temperature": 22}<|im_end|>\n<|im_start|>assistant\n'
[ SYS ] Expected response pattern: '^.*\\b22\\b.*$'
[ >>> ] 'The temperature is currently at 22 degrees Celsius.'
[ SYS ] Test Result: PASS
[ SYS ] Overall result for 'apresence.internlm2_5-7b-chat-Q4_K_M.gguf': PASS

Without --special

[ SYS ] === TEST MODEL: apresence.internlm2_5-7b-chat-Q4_K_M.gguf ===
[ SYS ] Args: ['./llama-server', '--model', 'apresence.internlm2_5-7b-chat-Q4_K_M.gguf', '--host', '127.0.0.1', '--port', '52756', '--gpu_layers', '32', '--split_mode', 'none']
[ >>> ] '<|im_start|>system\nYou are InternLM2-Chat, a harmless AI assistant<|im_end|>\n<|im_start|>system name=<|plugin|>\n[\n{\n"name": "get_current_weather",\n"description": "Get the current weather in a given location",\n"parameters": {\n"type": "object",\n"properties": {\n"location": {\n"type": "string",\n"description": "The city and state, e.g. San Francisco, CA",\n},\n"unit": {"type": "string"},\n},\n"required": ["location"],\n},\n}\n]\n<|im_end|>\n<|im_start|>user\nI want to know today\'s weather in Shanghai<|im_end|>\n<|im_start|>assistant\n'
[ SYS ] Expected response pattern: '^.*<\\|action_start\\|><|plugin|>\\n{"name": "get_current_weather", "parameters": {"location": "Shanghai"}}<\\|action_end\\|><\\|im_end\\|>$'
[ <<< ] 'I need to use the get_current_weather function to get the current weather in Shanghai.<|action_start|><|plugin|>\n{"name": "get_current_weather", "parameters": {"location": "Shanghai"}}<|action_end|>\n'
[ SYS ] Test Result: PASS
[ >>> ] '<|im_start|>environment name=<|plugin|>\n{"temperature": 22}<|im_end|>\n<|im_start|>assistant\n'
[ SYS ] Expected response pattern: '^.*\\b22\\b.*$'
[ >>> ] "It seems you're interested in the temperature, which is currently at 22 degrees Celsius. How can I assist you further today? Is there a specific task or information you need?"
[ SYS ] Test Result: PASS
[ SYS ] Overall result for 'apresence.internlm2_5-7b-chat-Q4_K_M.gguf': PASS

@apresence hi, thanks for your detailed info. llama-server with --special is fixed in this PR: https://github.com/ggerganov/llama.cpp/pull/8553

I've tested and it works with --system-prompt-file instead of --prompt arguments. Maybe an issue should be created on how to correctly use llama-server in llama.cpp .

  1. create sys-prompt.txt
echo '<|im_start|>system\nYou are InternLM2-Chat, a harmless AI assistant.<|im_end|>\n<|im_start|>system name=<|plugin|>[{"name": "get_current_weather", "parameters": {"required": ["location"], "type": "object", "properties": {"location": {"type": "string", "description": "The city and state, e.g. San Francisco, CA"}, "unit": {"type": "string"}}}, "description": "Get the current weather in a given location"}]<|im_end|>\n' >> sys-prompt.txt
  1. start server
CUDA_VISIBLE_DEVICES=2 build/bin/llama-server \
    --model internlm2_5-7b-chat-fp16.gguf \
    --predict 512 \
    --ctx-size 4096 \
    --gpu-layers 32 \
    --temp 0.8 \
    --top-p 0.8 \
    --top-k 50 \
    --seed 1024 \
    --color \
    --system-prompt-file sys-prompt.txt \
    --interactive \
    --multiline-input \
    --conversation \
    --in-suffix "<|im_end|>\n<|im_start|>assistant\n" \
    --special
  1. call service
from openai import OpenAI
client = OpenAI(
    api_key='YOUR_API_KEY',
    base_url='http://localhost:8080/v1'
)

messages = [{"role": "user", "content": "<|im_start|>user\nI want to know today's weather in Shanghai"}]

model_name = client.models.list().data[0].id
response = client.chat.completions.create(
  model=model_name,
  messages=messages,
  functions=tools,
  temperature=0.8,
  top_p=0.8
)
print( response.choices[0].message.content)

reulsts

I need to use the get_current_weather API to get the weather in Shanghai.<|action_start|><|plugin|>
{"name": "get_current_weather", "parameters": {"location": "Shanghai"}}<|action_end|>
<|im_end|>

Sign up or log in to comment