perSLIMmon-8b-base

persimmon-8b went to the vocab lipo clinic

A slimmed-down version of persimmon-8b-base which removes the ~70,000 unused entries in the model vocabulary and tokenizer (see the safetensors layer overview). Should be slightly faster.

Credit: fine-tune-fuyu (scripts/surgery.py was adapted for persimmon)

inference

install required pkgs:

pip install -U transformers accelerate bitsandbytes sentencepiece

load in 4bit & run inference:

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("pszemraj/perSLIMmon-8b-base")
model = AutoModelForCausalLM.from_pretrained(
    "pszemraj/perSLIMmon-8b-base",
    load_in_4bit=True, # GPU required
    torch_dtype="auto",
    device_map="auto",
)
inputs = tokenizer("The weather is always wonderful", return_tensors="pt").to(
    model.device
)
tokens = model.generate(
    **inputs,
    max_new_tokens=64,
    temperature=0.75,
    top_p=0.95,
    epsilon_cutoff=1e-5,
    repetition_penalty=1.05,
    renormalize_logits=True,
    do_sample=True,
) # adapt inference params as needed

print(tokenizer.decode(tokens[0], skip_special_tokens=True))

inference is decently fast on a colab T4:

CPU times: user 6.01 s, sys: 138 ms, total: 6.15 s
Wall time: 6.23 s
Downloads last month
740
Safetensors
Model size
8.82B params
Tensor type
BF16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.