File size: 4,166 Bytes
2439f29
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23de599
2439f29
 
23de599
2439f29
 
 
 
23de599
2439f29
 
 
 
 
23de599
2439f29
 
 
 
 
23de599
2439f29
 
 
 
 
23de599
2439f29
 
23de599
2439f29
 
23de599
2439f29
 
 
 
23de599
2439f29
 
 
 
 
23de599
2439f29
 
 
 
 
23de599
2439f29
 
 
 
 
23de599
2439f29
 
 
 
 
 
 
 
 
 
23de599
2439f29
 
 
 
 
23de599
2439f29
 
 
 
 
23de599
2439f29
 
 
 
 
23de599
2439f29
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
---
language:
- en
license: llama2
tags:
- meta
- llama-2
- wasmedge
- second-state
- llama.cpp
model_name: Llama 2 GGUF
inference: false
model_creator: Meta Llama 2
model_type: llama
pipeline_tag: text-generation
prompt_template: '[INST] <<SYS>>

  You are a helpful, respectful and honest assistant. Always answer as helpfully as
  possible, while being safe.  Your answers should not include any harmful, unethical,
  racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses
  are socially unbiased and positive in nature. If a question does not make any sense,
  or is not factually coherent, explain why instead of answering something not correct.
  If you don''t know the answer to a question, please don''t share false information.

  <</SYS>>

  {prompt}[/INST]

  '
quantized_by: wasmedge
---

This repo contains GGUF model files for cross-platform AI inference using the [WasmEdge Runtime](https://github.com/WasmEdge/WasmEdge). 
[Learn more](https://medium.com/stackademic/fast-and-portable-llama2-inference-on-the-heterogeneous-edge-a62508e82359) on why and how. 


## Prerequisite


Install WasmEdge with the GGML plugin. 

```
curl -sSf https://raw.githubusercontent.com/WasmEdge/WasmEdge/master/utils/install.sh | bash -s -- --plugin wasi_nn-ggml
```

Download the cross-platform Wasm apps for inference. 

```
curl -LO https://github.com/second-state/llama-utils/raw/main/simple/llama-simple.wasm

curl -LO https://github.com/second-state/llama-utils/raw/main/chat/llama-chat.wasm
```

## Use the quantized models


The `q5_k_m` version is a quantized version of the llama2 models. They are only half of the size of the original models, and hence consume half as much VRAM, but still give high-quality inference results. 

Chat with the 7b chat model

```
wasmedge --dir .:. --nn-preload default:GGML:AUTO:llama-2-7b-chat-q5_k_m.gguf llama-chat.wasm
```

Generate text with the 7b base model

```
wasmedge --dir .:. --nn-preload default:GGML:AUTO:llama-2-7b-q5_k_m.gguf llama-simple.wasm 'Robert Oppenheimer most important achievement is '
```

Chat with the 13b chat model

```
wasmedge --dir .:. --nn-preload default:GGML:AUTO:llama-2-13b-chat-q5_k_m.gguf llama-chat.wasm
```

Generate text with the 13b base model

```
wasmedge --dir .:. --nn-preload default:GGML:AUTO:llama-2-13b-q5_k_m.gguf llama-simple.wasm 'Robert Oppenheimer most important achievement is '
```

## Use the f16 models


The f16 version is the GGUF equivalent of the original llama2 models. It gives the best quality inference results but also consumes the most computing resources in both VRAM and computing time. The f16 models are also great as a basis for fine-tuning.

Chat with the 7b chat model

```
wasmedge --dir .:. --nn-preload default:GGML:AUTO:llama-2-7b-chat-f16.gguf llama-chat.wasm
```

Generate text with the 7b base model

```
wasmedge --dir .:. --nn-preload default:GGML:AUTO:llama-2-7b-f16.gguf llama-simple.wasm 'Robert Oppenheimer most important achievement is '
```

Chat with the 13b chat model

```
wasmedge --dir .:. --nn-preload default:GGML:AUTO:llama-2-13b-chat-f16.gguf llama-chat.wasm
```

Generate text with the 13b base model

```
wasmedge --dir .:. --nn-preload default:GGML:AUTO:llama-2-13b-f16.gguf llama-simple.wasm 'Robert Oppenheimer most important achievement is '
```

## Resource constrained models


The `q2_k` version is the smallest quantized version of the llama2 models. They can run on devices with only 4GB of RAM, but the inference quality is rather low.

Chat with the 7b chat model

```
wasmedge --dir .:. --nn-preload default:GGML:AUTO:llama-2-7b-chat-q2_k.gguf llama-chat.wasm
```

Generate text with the 7b base model

```
wasmedge --dir .:. --nn-preload default:GGML:AUTO:llama-2-7b-q2_k.gguf llama-simple.wasm 'Robert Oppenheimer most important achievement is '
```

Chat with the 13b chat model

```
wasmedge --dir .:. --nn-preload default:GGML:AUTO:llama-2-13b-chat-q2_k.gguf llama-chat.wasm
```

Generate text with the 13b base model

```
wasmedge --dir .:. --nn-preload default:GGML:AUTO:llama-2-13b-q2_k.gguf llama-simple.wasm 'Robert Oppenheimer most important achievement is '
```