--- license: apache-2.0 --- # Converted with ggerganov/ggml's stablelm conversion script, and tested with KoboldCpp. ## *(I can't promise that this will work with other frontends, if at all; I haven't had the most success myself. Use at your own risk!)* **2023-04-20:** *q4_3. Used [commit 05f3079](https://github.com/ggerganov/ggml/tree/05f307971862b83df12fada0c42ee027ba5a82b5/examples/stablelm).* **2023-04-30:** *q5_0, q5_1, and q8_0, up to 2.8B. I can't upload all conversions of 6.9B and 12B due to my internet. Used [commit 5dd92f4](https://github.com/ggerganov/ggml/tree/5dd92f421ee44f18b8fde0afbf5ca8fc7bf93841/examples/stablelm).* **2023-05-06:** *q4_0 and q4_2, up to 2.8B. Used [commit ff6e03c](https://github.com/ggerganov/ggml/tree/ff6e03cbcd9bf6e9fa41d49f2495c042efae4dc6/examples/stablelm).* **2023-05-15:** *New quantization format. q4_0 and q5_1, up to 2.8B. Used [commit 010203f](https://github.com/ggerganov/ggml/tree/010203f94a85df5c86b773dc5acb698c8e7b1e7b/examples/gpt-neox).* They're separated by date and commit so it's easier to track any breaking changes. # RAM USAGE (on KoboldCpp w/ OpenBLAS) Model | Initial RAM | After generation :--:|:--:|:--: Unloaded | 41.3 MiB ggml-pythia-70m-deduped-q4_0.bin | 113.3 MiB | 267.8 MiB ggml-pythia-70m-deduped-q5_1.bin | 121.5 MiB | 129.4 MiB ggml-pythia-160m-deduped-q4_0.bin | 199.4 MiB | 201.6 MiB ggml-pythia-160m-deduped-q5_1.bin | 227.5 MiB | 241.0 MiB ggml-pythia-410m-deduped-q4_0.bin | 399.2 MiB | 406.2 MiB ggml-pythia-410m-deduped-q5_1.bin | 455.7 MiB | 460.3 MiB ggml-pythia-1b-deduped-q4_0.bin | 803.0 MiB | 809.0 MiB ggml-pythia-1b-deduped-q5_1.bin | 921.5 MiB | 927.3 MiB ggml-pythia-1.4b-deduped-q4_0.bin | 1.1 GiB | 1.1 GiB ggml-pythia-1.4b-deduped-q5_1.bin | 1.3 GiB | 1.3 GiB ggml-pythia-2.8b-deduped-q4_0.bin | 2.0 GiB | 2.0 GiB ggml-pythia-2.8b-deduped-q5_1.bin | 2.4 GiB | 2.4 GiB # ALTERNATIVES If you're here because you want a smaller model to run on a device with constrained memory, consider the following: - OpenLLaMA [3B](https://huggingface.co/openlm-research/open_llama_3b_350bt_preview) [(7B)](https://huggingface.co/openlm-research/open_llama_7b_400bt_preview) - RedPajama-INCITE [(3B)](https://huggingface.co/togethercomputer/RedPajama-INCITE-Base-3B-v1) [(7B)](https://huggingface.co/togethercomputer/RedPajama-INCITE-Base-7B-v0.1) - MPT [(1B)](https://huggingface.co/mosaicml/mpt-1b-redpajama-200b) [(7B)](https://huggingface.co/mosaicml/mpt-7b). - RWKV PilePlus [(169M) (430M) (1.5B) (3B)](https://huggingface.co/BlinkDL/rwkv-4-pileplus) All of them are trained at least partially on an open reproduction of LLaMA's dataset, [RedPajama](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T), but they're based on different architectures. OpenLLaMA is based on the LLaMA architecture (making it compatible with llama.cpp), RedPajama-INCITE is based on GPT-NeoX, and MPT and RWKV use their own.