File size: 2,877 Bytes
28094bc
 
f2d0214
 
 
 
 
 
 
 
 
c90e7d2
28094bc
226438f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
97d022d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
226438f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
---
license: apache-2.0
tags:
- Text
- Text Generation
- Transformers
- English
- mixtral
- Merge
- Quantization
- MoE
- tinyllama
---

This is a q5_K_M GGUF quantization of https://huggingface.co/s3nh/TinyLLama-4x1.1B-MoE.

Not sure how well it performs, also my first quantization, so fingers crossed.

It is a Mixture of Experts model with https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0 as it's base model.

The other 3 models in the merge are:

https://huggingface.co/78health/TinyLlama_1.1B-function-calling

https://huggingface.co/phanerozoic/Tiny-Pirate-1.1b-v0.1

https://huggingface.co/Tensoic/TinyLlama-1.1B-3T-openhermes

I make no claims to any of the development, i simply wanted to try it out so I quantized and then thought I'd share it if anyone else was feeling experimental.

-------

default: #(from modelfile for tinyllama on ollama)

TEMPLATE """<|system|>
{{ .System }}</s>
<|user|>
{{ .Prompt }}</s>
<|assistant|>
"""
SYSTEM """You are a helpful AI assistant.""" #(Tweak this to adjust personality etc.)

PARAMETER stop "<|system|>"
PARAMETER stop "<|user|>"
PARAMETER stop "<|assistant|>"
PARAMETER stop "</s>"

-------

Model card from https://huggingface.co/s3nh/TinyLLama-4x1.1B-MoE

Example usage:

from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("s3nh/TinyLLama-1.1B-MoE")
tokenizer = AutoTokenizer.from_pretrained("s3nh/TinyLLama-1.1B-MoE")

input_text =  """
###Input: You are a pirate. tell me a story about wrecked ship.
###Response:
""")

input_ids = tokenizer.encode(input_text, return_tensors='pt').to(device)
output = model.generate(inputs=input_ids,
                        max_length=max_length,
                        do_sample=True,
                        top_k=10,
                        temperature=0.7,
                        pad_token_id=tokenizer.eos_token_id,
                        attention_mask=input_ids.new_ones(input_ids.shape))
tokenizer.decode(output[0], skip_special_tokens=True)

This model was possible to create by tremendous work of mergekit developers. I decided to merge tinyLlama models to create mixture of experts. Config used as below:

"""base_model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
experts:
  - source_model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
    positive_prompts:
    - "chat"
    - "assistant"
    - "tell me"
    - "explain"
  - source_model: 78health/TinyLlama_1.1B-function-calling
    positive_prompts:
    - "code"
    - "python"
    - "javascript"
    - "programming"
    - "algorithm"
  - source_model: phanerozoic/Tiny-Pirate-1.1b-v0.1
    positive_prompts:
    - "storywriting"
    - "write"
    - "scene"
    - "story"
    - "character"
  - source_model: Tensoic/TinyLlama-1.1B-3T-openhermes
    positive_prompts:
    - "reason"
    - "provide"
    - "instruct"
    - "summarize"
    - "count"
"""