Yukang/Llama-2-7b-longlora-16k · Model Context length is not 16k

model_id = "Yukang/Llama-2-7b-longlora-16k-ft"
tokenizer = AutoTokenizer.from_pretrained(model_id)

# quantization config
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="bfloat16",
)

device_map = "auto"

# model arguments
model_kwargs = dict(
    torch_dtype="auto",
    use_cache=False,
    device_map=device_map,
    quantization_config=quantization_config,
    trust_remote_code=True,
)

model = AutoModelForCausalLM.from_pretrained(model_id, **model_kwargs)

... code for inference and prompt with token length 6200

Output

This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (4096). Depending on the model, you may observe exceptions, performance degradation, or nothing at all.
.1. 1. 1. 1.
<. 1.
.
.
.
.
. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. . 1. . 1. 1. . .
.
.
.
. 1. . 1. 1. 1. .. .. . . . . . . .
.
[.
.............................::
..::
.............'[[ .. ...…1c .. ... 1
.................................., 1

.................:
.:


:








while 










.