UNIVA-Bllossom/DeepSeek-llama3.1-Bllossom-8B

Hello, I am encountering an issue where the LLM's responses do not finish properly and instead repeat sentences.
This does not happen with every response but occurs approximately once every two or three times.
After the process completes, the model outputs words like "답변" and "최종답변", which then start repeating.

I observed the same issue when using the GGUF-converted model with both llama.cpp and vLLM.
Even when I ran the example you provided, the problem persisted.

import time
import torch
import gc
import random
from transformers import AutoModelForCausalLM, AutoTokenizer
from data import test_user_contents

model = AutoModelForCausalLM.from_pretrained(
    "UNIVA-Bllossom/DeepSeek-llama3.1-Bllossom-8B",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("UNIVA-Bllossom/DeepSeek-llama3.3-Bllossom-70B")

system='''
You are a highly capable assistant. For every user question, follow these instructions exactly:
    1.	First, think through the problem step-by-step in English. Enclose all of your internal reasoning between <think> and </think> tags. This chain-of-thought should detail your reasoning process.
    2.	After the closing </think> tag, provide your final answer.
    3.	Do not include any additional text or commentary outside of this format.
    4.	Your output should strictly follow this structure:

<think>
[Your detailed step-by-step reasoning in English]
</think>
[Your final answer]
'''


for i in range(10):
    start_time = time.time() 
    user_content = random.choice(test_user_contents)
    chat = [
        {"role": "system", "content": system},
        {"role": "user", "content": user_content}
    ]

    prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
    
    model_inputs = tokenizer(
        prompt,
        return_tensors="pt",
        add_special_tokens=True
    )

    if "token_type_ids" in model_inputs:
        del model_inputs["token_type_ids"]

    model_inputs = {k: v.to(model.device) for k, v in model_inputs.items()}

    with torch.no_grad():  
        generated_ids = model.generate(
            **model_inputs,
            max_new_tokens=4096, 
        )

    response_text = tokenizer.decode(generated_ids[0].cpu(), skip_special_tokens=True)  

    print(response_text)
    

    del model_inputs, generated_ids
    torch.cuda.empty_cache()  
    gc.collect()

I also tried adding code to clear the cache and repetition penalty, but it did not resolve the issue.
Could you provide any insights into what might be causing this problem?
Thank you!

UNIVA-Bllossom
/

DeepSeek-llama3.1-Bllossom-8B

Response Repeating Issue