pszemraj commited on
Commit
9fec9a3
·
1 Parent(s): 8da60ba

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +47 -2
README.md CHANGED
@@ -12,6 +12,51 @@ tags:
12
  > persimmon-8b went to the vocab lipo clinic
13
 
14
 
15
- This is a slimmed-down version of [persimmon-8b-base](https://huggingface.co/adept/persimmon-8b-base) that removes the 70,000 unused entries in the model vocab and tokenizer (check out the safetensors layer overview). Should be _slightly_ faster.
16
 
17
- Credit: [fine-tune-fuyu](https://github.com/phillip-kravtsov/fine-tune-fuyu) (`scripts/surgery.py` was adapted for persimmon)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
  > persimmon-8b went to the vocab lipo clinic
13
 
14
 
15
+ A slimmed-down version of [persimmon-8b-base](https://huggingface.co/adept/persimmon-8b-base) which removes the ~70,000 unused entries in the model vocabulary and tokenizer (see the safetensors layer overview). Should be _slightly_ faster.
16
 
17
+ Credit: [fine-tune-fuyu](https://github.com/phillip-kravtsov/fine-tune-fuyu) (`scripts/surgery.py` was adapted for persimmon)
18
+
19
+
20
+ ## inference
21
+
22
+ install required pkgs:
23
+
24
+ ```sh
25
+ pip install -U transformers accelerate bitsandbytes sentencepiece
26
+ ```
27
+
28
+ load in 4bit & run inference:
29
+
30
+ ```python
31
+ from transformers import AutoModelForCausalLM, AutoTokenizer
32
+
33
+ tokenizer = AutoTokenizer.from_pretrained("pszemraj/perSLIMmon-8b-base")
34
+ model = AutoModelForCausalLM.from_pretrained(
35
+ "pszemraj/perSLIMmon-8b-base",
36
+ load_in_4bit=True, # GPU required
37
+ torch_dtype="auto",
38
+ device_map="auto",
39
+ )
40
+ inputs = tokenizer("The weather is always wonderful", return_tensors="pt").to(
41
+ model.device
42
+ )
43
+ tokens = model.generate(
44
+ **inputs,
45
+ max_new_tokens=64,
46
+ temperature=0.75,
47
+ top_p=0.95,
48
+ epsilon_cutoff=1e-5,
49
+ repetition_penalty=1.05,
50
+ renormalize_logits=True,
51
+ do_sample=True,
52
+ ) # adapt inference params as needed
53
+
54
+ print(tokenizer.decode(tokens[0], skip_special_tokens=True))
55
+ ```
56
+
57
+ inference is decently fast on a colab T4:
58
+
59
+ ```
60
+ CPU times: user 6.01 s, sys: 138 ms, total: 6.15 s
61
+ Wall time: 6.23 s
62
+ ```