komt-Llama-2-13b-hf-ggml
https://github.com/davidkim205/komt
This model quantized the korean Llama 2 13B using llama.cpp to 4-bit quantization.
Our model, being in the same format as TheBloke's ggml, supports the following libraries or UI.
The following content references TheBloke/Llama-2-13B-chat-GGML.
GGML files are for CPU + GPU inference using llama.cpp and libraries and UIs which support this format, such as:
- KoboldCpp, a powerful GGML web UI with full GPU acceleration out of the box. Especially good for story telling.
- LoLLMS Web UI, a great web UI with GPU acceleration via the c_transformers backend.
- LM Studio, a fully featured local GUI. Supports full GPU accel on macOS. Also supports Windows, without GPU accel.
- text-generation-webui, the most popular web UI. Requires extra steps to enable GPU accel via llama.cpp backend.
- ctransformers, a Python library with LangChain support and OpenAI-compatible AI server.
- llama-cpp-python, a Python library with OpenAI-compatible API server.
Model Details
- Model Developers : davidkim(changyeon kim)
- Repository : https://github.com/davidkim205/komt
- quant methods : q4_0, q4_1, q5_0, q5_1, q2_k, q3_k, q3_k_m, q3_k_l, q4_k, q4_k_s, q4_k_m, q5_k, q5_k_s, q5_k_m, q8_0, q4_0
Prompt Template
### instruction: {prompt}
### Response:
Examples:
### instruction: ์๋์ฐจ ์ข
ํฉ(์ ๊ธฐ)๊ฒ์ฌ ์๋ฌด๊ธฐ๊ฐ์ ์ผ๋ง์ธ๊ฐ์?
### Response:
response:
### instruction: ์๋์ฐจ ์ข
ํฉ(์ ๊ธฐ)๊ฒ์ฌ ์๋ฌด๊ธฐ๊ฐ์ ์ผ๋ง์ธ๊ฐ์?
### Response:์๋์ฐจ ์ข
ํฉ(์ ๊ธฐ)๊ฒ์ฌ๋ 2๋
1991๋
7์ 1์ผ์ ๊ณ ์๋ '์๋์ฐจ ๋ณดํ๋ฃ ์กฐ์ ๊ธฐ์ค'์์ ์ทจ๋ฆฌ๋ก๋ถํฐ ์ ์ ๋ ๊ธฐ์ค ์ ๊ฒฝ๋ ์ด์์ฐจ๋ฅผ ์ ์ธํ ์๋์ฐจ ๋ชจ๋ ์น์ฉ์๋์ฐจ๋ 2๋
๋ง๋ค ํ์ํ๋ค. ์ด ๋ฒ์ ์ฐจ๋์ ๊ด๊ณ์์ด 2๋
๋ง๋ค ์ ๊ธฐ๊ฒ์ฌ๋ฅผ ํด์ผํ๋ค๊ณ ๊ท์ ํ๋ค.
Usage
When using the original llama.cpp
make -j && ./main -m models/komt-Llama-2-13b-hf-ggml/ggml-model-q8_0.bin -p "### instruction: ์ํ ํด๋ฆฌํฌํฐ ์๋ฆฌ์ฆ ๋ฐฐ๊ธ์ฌ๊ฐ ์ด๋์ผ\n\n### Response:"
When using the modified llama.cpp for korean multi-task (recommended): Refer https://github.com/davidkim205/komt/tree/main/llama.cpp
make -j && ./main -m models/komt-Llama-2-13b-hf-ggml/ggml-model-q8_0.bin -p "์ํ ํด๋ฆฌํฌํฐ ์๋ฆฌ์ฆ ๋ฐฐ๊ธ์ฌ๊ฐ ์ด๋์ผ"
response:
$ make -j && ./main -m models/komt-Llama-2-13b-hf-ggml/ggml-model-q8_0.bin -p "์ํ ํด๋ฆฌํฌํฐ ์๋ฆฌ์ฆ ๋ฐฐ๊ธ์ฌ๊ฐ ์ด๋์ผ"
I llama.cpp build info:
I UNAME_S: Linux
I UNAME_P: x86_64
I UNAME_M: x86_64
I CFLAGS: -I. -O3 -std=c11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS
I LDFLAGS:
I CC: cc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
I CXX: g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
make: Nothing to be done for 'default'.
main: build = 6 (01a61bf)
main: seed = 1692190774
llama.cpp: loading model from models/komt-Llama-2-13b-hf-ggml/ggml-model-q8_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 6912
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_head_kv = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: n_gqa = 1
llama_model_load_internal: rnorm_eps = 5.0e-06
llama_model_load_internal: n_ff = 13824
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 7 (mostly Q8_0)
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 0.11 MB
llama_model_load_internal: mem required = 13152.13 MB (+ 400.00 MB per state)
llama_new_context_with_model: kv self size = 400.00 MB
llama_new_context_with_model: compute buffer total size = 75.35 MB
system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0
### instruction: ์ํ ํด๋ฆฌํฌํฐ ์๋ฆฌ์ฆ ๋ฐฐ๊ธ์ฌ๊ฐ ์ด๋์ผ
### Response:์๋ ๋ธ๋ผ๋์ค
ํด๋ฆฌํฌํฐ(Harry Potter)๋ J. K. ๋กค๋ง์ด ์ด ํํ์ง ์์ค์ด๋ค. 1997๋
๋ถํฐ 2007๋
๊น์ง ์ด 7๊ถ์ผ๋ก ๋ฐํ๋์๊ณ , ์ ์ธ๊ณ์ ์ผ๋ก ๋ง์ ์ธ๊ธฐ๋ฅผ ๋์๋ค. ์๊ตญ์์๋ ๋ธ๋ฃธ๋ฒ๊ทธ(Bloomsbury), ๋ฏธ๊ตญ์์๋ ์๋ ๋ธ๋ผ๋์ค(Warner Brothers)๊ฐ ๊ฐ๊ฐ ์ถํํ์๋ค. ํ์ฌ ์ ์ธ๊ณ์ ์ผ๋ก 2์ต 4,000๋ง ๋ถ ์ด์์ ํ๋งค๊ณ ๋ฅผ ์ฌ๋ฆฌ๊ณ ์์ผ๋ฉฐ, ์ ์ธ๊ณ ๋๋ถ๋ถ์ ๋ฌธํ๊ฐ๋ค์๊ฒ ์ํฅ์ ์ฃผ์๋ค. ### check_end_of_text [end of text]
llama_print_timings: load time = 801.73 ms
llama_print_timings: sample time = 108.54 ms / 308 runs ( 0.35 ms per token, 2837.66 tokens per second)
llama_print_timings: prompt eval time = 2651.47 ms / 43 tokens ( 61.66 ms per token, 16.22 tokens per second)
llama_print_timings: eval time = 120629.25 ms / 307 runs ( 392.93 ms per token, 2.54 tokens per second)
llama_print_timings: total time = 123440.86 ms
- Downloads last month
- 0