--- language: - en pipeline_tag: text-generation tags: - Microsoft - Phi3 - Pytorch --- # SandLogic Technologies - Quantized Phi-3.1-mini-4k-instruct Models ## Model Description We have quantized the Phi-3.1-mini-4k-instruct model into three variants: 1. Q5_KM 2. Q4_KM 3. IQ4_XS These quantized models offer improved efficiency while maintaining performance. Discover our full range of quantized language models by visiting our [SandLogic Lexicon](https://github.com/sandlogic/SandLogic-Lexicon) GitHub. To learn more about our company and services, check out our website at [SandLogic](https://www.sandlogic.com). ## Original Model Information - **Name**: [Phi-3.1-mini-4k-instruct](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct) - **Developer**: Microsoft - **Model Type**: Open-source language model - **Parameters**: 3.8 billion - **Context Length**: 4,000 tokens - **Training Data**: 3.3 trillion tokens, including curated public documents, synthetic "textbook-like" data, and high-quality chat data - **Language**: English ## Model Capabilities The Phi-3.1-mini-4k-instruct model is designed for a variety of commercial and research applications, particularly in environments with limited memory or computational resources, scenarios requiring low latency, and tasks involving robust reasoning capabilities, such as mathematics and logic. The model's key capabilities include: 1. Instruction following 2. Structured output generation 3. High-quality multi-turn conversations 4. Explicit support for the <|system|> tag 5. Improved reasoning capabilities ## Use Cases 1. **Environments with Limited Resources**: Suitable for deployment on devices with limited memory or computational power, such as laptops, desktops, or edge devices. 2. **Low-Latency Applications**: Ideal for use cases where quick responses are critical, such as customer service chatbots or real-time text generation. 3. **Mathematics and Logic-Based Tasks**: Performs well on tasks requiring robust reasoning capabilities, including math problem-solving and logical inference. 4. **Processing and Analyzing Long-Form Text**: Able to handle and analyze large amounts of text efficiently. ## Model Variants We offer three quantized versions of the Phi-3.1-mini-4k-instruct model: 1. **Q5_KM**: 5-bit quantization using the KM method 2. **Q4_KM**: 4-bit quantization using the KM method 3. **IQ4_XS**: 4-bit quantization using the IQ4_XS method These quantized models aim to reduce model size and improve inference speed while maintaining performance as close to the original model as possible. ## Input and Output - **Input**: Text string (e.g., instructions, prompts, or long-form text) - **Output**: Generated text following the input, with structured output, improved reasoning, and adherence to the <|system|> tag ## Usage ```bash pip install llama-cpp-python ``` Please refer to the llama-cpp-python [documentation](https://llama-cpp-python.readthedocs.io/en/latest/) to install with GPU support. ### Basic Text Completion Here's an example demonstrating how to use the high-level API for basic text completion: ```bash from llama_cpp import Llama llm = Llama( model_path="./Phi-3-mini-4k-instruct-q4.gguf", # path to GGUF file n_ctx=4096, # The max sequence length to use - note that longer sequence lengths require much more resources n_threads=8, # The number of CPU threads to use, tailor to your system and the resulting performance n_gpu_layers=35, # The number of layers to offload to GPU, if you have GPU acceleration available. Set to 0 if no GPU acceleration is available on your system. ) prompt = "How to explain Internet to a medieval knight?" # Simple inference example output = llm( f"<|user|>\n{prompt}<|end|>\n<|assistant|>", max_tokens=256, # Generate up to 256 tokens stop=["<|end|>"], echo=True, # Whether to echo the prompt ) print(output['choices'][0]['text']) ``` ## Download You can download `Llama` models in `gguf` format directly from Hugging Face using the `from_pretrained` method. This feature requires the `huggingface-hub` package. To install it, run: `pip install huggingface-hub` ```bash from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="SandLogicTechnologies/Phi-3.1-mini-4k-instruct-GGUF", filename="*Phi-3.1-mini-4k-instruct-Q5_K_M.gguf", verbose=False ) ``` By default, from_pretrained will download the model to the Hugging Face cache directory. You can manage installed model files using the huggingface-cli tool. ## License Phi-3-mini-4k-instruct license - [Phi-3](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct/blob/main/LICENSE) ## Acknowledgements We thank the Microsoft team for developing and releasing the original Phi-3.1-mini-4k-instruct model. Special thanks to Georgi Gerganov and the entire llama.cpp development team for their outstanding contributions. ## Contact For any inquiries or support, please contact us at support@sandlogic.com or visit our [support page](https://www.sandlogic.com/LingoForge/support).