pszemraj
/

perSLIMmon-8b-base

Text Generation

Inference Endpoints

Model card Files Files and versions Community

perSLIMmon-8b-base / README.md

pszemraj's picture

Update README.md

9fec9a3 about 1 year ago

|

history blame contribute delete

1.52 kB

	---
	license: apache-2.0
	language:
	- en
	pipeline_tag: text-generation
	tags:
	- persimmon
	---

	# perSLIMmon-8b-base

	> persimmon-8b went to the vocab lipo clinic


	A slimmed-down version of [persimmon-8b-base](https://huggingface.co/adept/persimmon-8b-base) which removes the ~70,000 unused entries in the model vocabulary and tokenizer (see the safetensors layer overview). Should be _slightly_ faster.

	Credit: [fine-tune-fuyu](https://github.com/phillip-kravtsov/fine-tune-fuyu) (`scripts/surgery.py` was adapted for persimmon)


	## inference

	install required pkgs:

	```sh
	pip install -U transformers accelerate bitsandbytes sentencepiece
	```

	load in 4bit & run inference:

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained("pszemraj/perSLIMmon-8b-base")
	model = AutoModelForCausalLM.from_pretrained(
	"pszemraj/perSLIMmon-8b-base",
	load_in_4bit=True, # GPU required
	torch_dtype="auto",
	device_map="auto",
	)
	inputs = tokenizer("The weather is always wonderful", return_tensors="pt").to(
	model.device
	)
	tokens = model.generate(
	**inputs,
	max_new_tokens=64,
	temperature=0.75,
	top_p=0.95,
	epsilon_cutoff=1e-5,
	repetition_penalty=1.05,
	renormalize_logits=True,
	do_sample=True,
	) # adapt inference params as needed

	print(tokenizer.decode(tokens[0], skip_special_tokens=True))
	```

	inference is decently fast on a colab T4:

	```
	CPU times: user 6.01 s, sys: 138 ms, total: 6.15 s
	Wall time: 6.23 s
	```