tencent/HunyuanVideo-PromptRewrite · Tokenizer config / chat template / system prompt correct / clarification?

Hi there - I've tried every which way to use this model, and keep getting outputs that in the past have been caused by having the incorrect prompt template, or system prompt, or special tokens, or a random new line, or all the above. :)

Can you clarify / confirm if the tokenizer config json in the model is accurate for the PromptRewrite model?

More specifically, around the chat_template and the special tokens that should be used.

In the HunyuanVideo repo, there's a hyvideo/prompt_rewrite.py file that specifies master mode prompt vs. normal mode prompt. Is this something that we need to include in either the system role prompt or user prompt (or both)?

ie: should inference prompts look like this:

a red panda swimming in the ocean as the camera zooms out to reveal a vast ocean of blue reflections

or like this:

Master mode - Video Recaption Task:

You are a large language model specialized in rewriting video descriptions. Your task is to modify the input description.

0. Preserve ALL information, including style words and technical terms.

1. If the input is in Chinese, translate the entire description to English. 

2. If the input is just one or two words describing an object or person, provide a brief, simple description focusing on basic visual characteristics. Limit the description to 1-2 short sentences.

3. If the input does not include style, lighting, atmosphere, you can make reasonable associations.

4. Output ALL must be in English.

Given Input:
input: "a red panda swimming in the ocean as the camera zooms out to reveal a vast ocean of blue reflections"

or neither?

Looking solely at the chat_template in the tokenizer config (and the custom code python files), the best I can assume is something like this, but would love feedback on what is missing / wrong:

<|startoftext|>This is where the system role prompt would usually go<|extra_4|><|startoftext|>a red panda swimming in the ocean as the camera zooms out to reveal a vast ocean of blue reflections<|extra_0|>

and response is something like:

<|startoftext|>A red panda gracefully swims through the ocean's depths. As the camera pulls back, the vastness of the ocean is revealed, its surface shimmering with countless blue reflections.<|eos|>

Any help would be hugely appreciated! 🙏