--- license: llama2 --- # Sample repository Development Status :: 2 - Pre-Alpha
Developed by MinWoo Park, 2023, Seoul, South Korea. [Contact: parkminwoo1991@gmail.com](mailto:parkminwoo1991@gmail.com). [![Hits](https://hits.seeyoufarm.com/api/count/incr/badge.svg?url=https%3A%2F%2Fhuggingface.co%2Fdanielpark%2Fko-llama-2-jindo-7b-instruct-ggml&count_bg=%23000000&title_bg=%23555555&icon=&icon_color=%23E7E7E7&title=views&edge_flat=false)](https://hits.seeyoufarm.com) # What is [GGML](https://github.com/ggerganov/ggml)? GGML is a tensor library for machine learning to enable large models and high performance on commodity hardware. Development of ggml is underway for a more efficient format and new k-quant method, so it is not stable. Read more at [GGUF documentation](https://github.com/philpax/ggml/blob/gguf-spec/docs/gguf.md#why-not-other-formats). ## Model Weights Offered | Model | Size(GB) | Description | Performance | | --- | --- | --- | --- | | [jindo-7b-instruct](https://huggingface.co/danielpark/ko-llama-2-jindo-7b-instruct) | 12.6 | original model weight | | | jindo-7b-instruct.ggmlv3.f16.bin | 12.5 | model weight converted to ggml f16 format | | | jindo-7b-instruct.ggmlv3.q4_0.bin | 3.73 | 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. | Legacy | | jindo-7b-instruct.ggmlv3.q4_k_m.bin | 3.98 | 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. | Medium, balanced quality | | jindo-7b-instruct.ggmlv3.q5_k_m.bin | 4.67 | 5-bit quantization. Same super-block structure as GGML_TYPE_Q4_K resulting in 5.5 bpw. | Large, very low quality loss | ## Prompt template: None ``` {prompt} ```

# Inference To perform inference using the danielpark/ko-llama-2-jindo-7b-instruct-ggml weights fine-tuned with llama2 on CPU or GPU, you need to set up the appropriate installation and configuration on your system. Please refer to the [llama.cpp repository](https://github.com/ggerganov/llama.cpp), the [langchain's documentation](https://python.langchain.com/docs/integrations/llms/llamacpp), and follow the guides for various dependency software as needed. [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/#fileId=https%3A//huggingface.co/danielpark/ko-llama-2-jindo-7b-instruct-ggml/blob/main/ggml_format_inference.ipynb) ### Using [LLaMA CPP module in Lang-Chain](https://python.langchain.com/docs/integrations/llms/llamacpp) ``` $ pip install langchain ctransformers llama-cpp-python ``` ```python from langchain.llms import LlamaCpp from langchain import PromptTemplate, LLMChain from langchain.callbacks.manager import CallbackManager from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler ``` #### CPU ```python # Make sure the model path is correct for your system! llm = LlamaCpp( model_path="./models/jindo-7b-instruct-ggml-model-f16.bin", input={"temperature": 0.75, "max_length": 2000, "top_p": 1}, callback_manager=callback_manager, verbose=True, ) ``` #### GPU If the installation with BLAS backend was correct, you will see an BLAS = 1 indicator in model properties. Two of the most important parameters for use with GPU are: - n_gpu_layers - determines how many layers of the model are offloaded to your GPU. - n_batch - how many tokens are processed in parallel. ```python n_gpu_layers = 40 # Change this value based on your model and your GPU VRAM pool. n_batch = 512 # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU. # Make sure the model path is correct for your system! llm = LlamaCpp( model_path="./models/jindo-7b-instruct-ggml-model-f16.bin", n_gpu_layers=n_gpu_layers, n_batch=n_batch, callback_manager=callback_manager, verbose=True, ) ``` #### Metal ```python n_gpu_layers = 1 # Metal set to 1 is enough. n_batch = 512 # Should be between 1 and n_ctx, consider the amount of RAM of your Apple Silicon Chip. # Make sure the model path is correct for your system! llm = LlamaCpp( model_path="./models/jindo-7b-instruct-ggml-model-f16.bin", n_gpu_layers=n_gpu_layers, n_batch=n_batch, f16_kv=True, # MUST set to True, otherwise you will run into problem after a couple of calls callback_manager=callback_manager, verbose=True, ) ``` ### Using [C Transformers module in Lang-Chain](https://python.langchain.com/docs/integrations/llms/ctransformers.html) ```python from langchain.llms import CTransformers llm = CTransformers(model="./models/jindo-7b-instruct-ggml-model-f16.bin", model_type='llama') print(llm('LLM Jindo is going to')) ``` ## Web Demo I implement the web demo using several popular tools that allow us to rapidly create web UIs. | model | web ui | quantinized | | --- | --- | --- | | danielpark/ko-llama-2-jindo-7b-instruct. | using [gradio](https://github.com/dsdanielpark/gradio) on [colab](https://colab.research.google.com/drive/1zwR7rz6Ym53tofCGwZZU8y5K_t1r1qqo#scrollTo=p2xw_g80xMsD) | - | | danielpark/ko-llama-2-jindo-7b-instruct-4bit-128g-gptq | using [text-generation-webui](https://github.com/oobabooga/text-generation-webui) on [colab](https://colab.research.google.com/drive/19ihYHsyg_5QFZ_A28uZNR_Z68E_09L4G) | gptq | | danielpark/ko-llama-2-jindo-7b-instruct-ggml | [koboldcpp-v1.38](https://github.com/LostRuins/koboldcpp/releases/tag/v1.38) | ggml | ## Tools
See more... | Name | Description | |-----------------------------------------|---------------------------------------------------------| | [KoboldCpp](https://github.com/LostRuins/koboldcpp) | A powerful GGML web UI with full GPU acceleration out of the box. Especially good for story-telling. | | [LoLLMS Web UI](https://github.com/ParisNeo/lollms-webui) | A great web UI with GPU acceleration via the c_transformers backend. | | [LM Studio](https://lmstudio.ai/) | A fully featured local GUI. Supports full GPU acceleration on macOS. Also supports Windows, without GPU accel. | | [text-generation-webui](https://github.com/oobabooga/text-generation-webui) | The most popular web UI. Requires extra steps to enable GPU accel via the llama.cpp backend. | | [ctransformers](https://github.com/marella/ctransformers) | A Python library with LangChain support and OpenAI-compatible AI server. | | [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) | A Python library with OpenAI-compatible API server. |
## CLI Inference using Quntinized Weight To use the program with the desired settings, execute the following command: ``` ./main -t -ngl -m ko-llama-2-jindo-7b-instruct-ggml.bin --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "### Instruction: Write a story about llamas\n### Response:" ``` Please make the following changes: - Replace `` with the number of physical CPU cores you have. For example, if your system has 8 cores/16 threads, use `-t 8`. - Replace `` with the number of layers to offload to the GPU. If you don't have GPU acceleration, you can remove the `-ngl` argument. - If you want to have a chat-style conversation, replace the `-p ""` argument with `-i -ins`. Check for more details at [llama.cpp](https://github.com/ggerganov/llama.cpp#build), [llama-cpp-python](https://github.com/abetlen/llama-cpp-python), [llama2.c](https://github.com/karpathy/llama2.c)
See more... ### Quant Types | Quantization Type | Description | Bits per Weight (bpw) | |-------------------|-------------------------------------------------------------------------------------------------------|-----------------------| | GGML_TYPE_Q2_K | "type-1" 2-bit quantization in super-blocks containing 16 blocks, each block having 16 weights. Block scales and mins are quantized with 4 bits. | 2.5625 | | GGML_TYPE_Q3_K | "type-0" 3-bit quantization in super-blocks containing 16 blocks, each block having 16 weights. Scales are quantized with 6 bits. | 3.4375 | | GGML_TYPE_Q4_K | "type-1" 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. | 4.5 | | GGML_TYPE_Q5_K | "type-1" 5-bit quantization. Same super-block structure as GGML_TYPE_Q4_K resulting in 5.5 bpw. | 5.5 | | GGML_TYPE_Q6_K | "type-0" 6-bit quantization. Super-blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. | 6.5625 | | GGML_TYPE_Q8_K | "type-0" 8-bit quantization. Only used for quantizing intermediate results. Block size is 256. All 2-6 bit dot products are implemented for this quantization type. | Not specified | | Model | Description | Recommendation | |-------|--------------------------------------------------|---------------------| | Q4_0 | Small, very high-quality loss | Legacy, prefer Q3_K_M | | Q4_1 | Small, substantial quality loss | Legacy, prefer Q3_K_L | | Q5_0 | Medium, balanced quality | Legacy, prefer Q4_K_M | | Q5_1 | Medium, low quality loss | Legacy, prefer Q5_K_M | | Q2_K | Smallest, extreme quality loss | Not recommended | | Q3_K | Alias for Q3_K_M | | | Q3_K_S| Very small, very high-quality loss | | | Q3_K_M| Very small, very high-quality loss | | | Q3_K_L| Small, substantial quality loss | | | Q4_K | Alias for Q4_K_M | | | Q4_K_S| Small, significant quality loss | | | Q4_K_M| Medium, balanced quality | Recommended | | Q5_K | Alias for Q5_K_M | | | Q5_K_S| Large, low quality loss | Recommended | | Q5_K_M| Large, very low quality loss | Recommended | | Q6_K | Very large, extremely low quality loss | | | Q8_0 | Very large, extremely low quality loss | Not recommended | | F16 | Extremely large, virtually no quality loss | Not recommended | | F32 | Absolutely huge, lossless | Not recommended | ### Performance #### LLaMA 2 / 7B | name | +ppl | +ppl 13b to 7b % | size | size 16bit % | +ppl per -1G | |-------|--------|------------------|-------|--------------|--------------| | q2_k | 0.8698 | 133.344% | 2.67GB | 20.54% | 0.084201 | | q3_ks | 0.5505 | 84.394% | 2.75GB | 21.15% | 0.053707 | | q3_km | 0.2437 | 37.360% | 3.06GB | 23.54% | 0.024517 | | q3_kl | 0.1803 | 27.641% | 3.35GB | 25.77% | 0.018684 | | q4_0 | 0.2499 | 38.311% | 3.50GB | 26.92% | 0.026305 | | q4_1 | 0.1846 | 28.300% | 3.90GB | 30.00% | 0.020286 | | q4_ks | 0.1149 | 17.615% | 3.56GB | 27.38% | 0.012172 | | q4_km | 0.0535 | 8.202% | 3.80GB | 29.23% | 0.005815 | | q5_0 | 0.0796 | 12.203% | 4.30GB | 33.08% | 0.009149 | | q5_1 | 0.0415 | 6.362% | 4.70GB | 36.15% | 0.005000 | | q5_ks | 0.0353 | 5.412% | 4.33GB | 33.31% | 0.004072 | | q5_km | 0.0142 | 2.177% | 4.45GB | 34.23% | 0.001661 | | q6_k | 0.0044 | 0.675% | 5.15GB | 39.62% | 0.000561 | | q8_0 | 0.0004 | 0.061% | 6.70GB | 51.54% | 0.000063 | #### LLaMA 2 / 13B | name | +ppl | +ppl 13b to 7b % | size | size 16bit % | +ppl per -1G | |-------|--------|------------------|-------|--------------|--------------| | q2_k | 0.6002 | 92.013% | 5.13GB | 20.52% | 0.030206 | | q3_ks | 0.3490 | 53.503% | 5.27GB | 21.08% | 0.017689 | | q3_km | 0.1955 | 29.971% | 5.88GB | 23.52% | 0.010225 | | q3_kl | 0.1520 | 23.302% | 6.45GB | 25.80% | 0.008194 | | q4_0 | 0.1317 | 20.190% | 6.80GB | 27.20% | 0.007236 | | q4_1 | 0.1065 | 16.327% | 7.60GB | 30.40% | 0.006121 | | q4_ks | 0.0861 | 13.199% | 6.80GB | 27.20% | 0.004731 | | q4_km | 0.0459 | 7.037% | 7.32GB | 29.28% | 0.002596 | | q5_0 | 0.0313 | 4.798% | 8.30GB | 33.20% | 0.001874 | | q5_1 | 0.0163 | 2.499% | 9.10GB | 36.40% | 0.001025 | | q5_ks | 0.0242 | 3.710% | 8.36GB | 33.44% | 0.001454 | | q5_km | 0.0095 | 1.456% | 8.60GB | 34.40% | 0.000579 | | q6_k | 0.0025 | 0.383% | 9.95GB | 39.80% | 0.000166 | | q8_0 | 0.0005 | 0.077% | 13.00GB| 52.00% | 0.000042 |
#### Reference Model Cards The model card of the repository [TheBloke/Llama-2-13B-GGML](https://huggingface.co/TheBloke/Llama-2-13B-GGML) where LLaMA2 has been converted to [GGML](https://github.com/ggerganov/ggml). `llama.cpp` pull request [#1687](https://github.com/ggerganov/llama.cpp/pull/1684) for quantinized weight performance. #### Note - Simply download the single GGML file format weight. The other files are for reference purposes only during development. After conducting several experiments, we will provide the final GGML weight file separately.