meta-llama/Llama-3.2-11B-Vision-Instruct · Is possible to make the model only return the response without the prompt?

Oct 17, 2024

With the example of the code you posted I can only make it return the entire prompt with the model response at the end, how it's usual in this models. But when you use a pipeline you can avoid all that stuff, is there a way to make it work like an LLM with pipeline to make it return only the model response/answer?

zwilliams6886

Oct 17, 2024

I'm not sure what's conventional as this is the most I've used transformers, but you can always strip it based on the special tokens right?

    start_token = "<|start_header_id|>assistant<|end_header_id|>"
    end_token = "<|eot_id|>"

    start_index = processor_output.find(start_token) + len(start_token)
    end_index = processor_output.rfind(end_token)

    if start_index != -1 and end_index != -1 and start_index < end_index:
        content = processor_output[start_index:end_index].strip()

I know skip_special_tokenscan be flagged while decoding, but they seem to provide good structure.

bibhas22

Oct 26, 2024

By nature, transformer model would output the input prompt first and then start to generate new tokens. You can easily strip out the input and the last end of text token (EOT) from the output.

#Generate tokens
output = model.generate(**inputs, max_new_tokens=250, temperature=0.1,)

#Strip out the input tokens and the last end of text token (EOT)
num_input_tokens = inputs["input_ids"].shape[1]
cleaned_output = output[0, num_input_tokens : -1]

print(processor.decode(cleaned_output))