---
language:
- en
pipeline_tag: text-generation
tags:
- Microsoft
- Phi3
- Pytorch
---
# SandLogic Technologies - Quantized Phi-3.1-mini-4k-instruct Models

## Model Description

We have quantized the Phi-3.1-mini-4k-instruct model into three variants:

1. Q5_KM
2. Q4_KM
3. IQ4_XS

These quantized models offer improved efficiency while maintaining performance.

Discover our full range of quantized language models by visiting our [SandLogic Lexicon](https://github.com/sandlogic/SandLogic-Lexicon) GitHub.
To learn more about our company and services, check out our website at [SandLogic](https://www.sandlogic.com).

## Original Model Information

- **Name**: [Phi-3.1-mini-4k-instruct](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct)
- **Developer**: Microsoft
- **Model Type**: Open-source language model
- **Parameters**: 3.8 billion
- **Context Length**: 4,000 tokens
- **Training Data**: 3.3 trillion tokens, including curated public documents, synthetic "textbook-like" data, and high-quality chat data
- **Language**: English

## Model Capabilities

The Phi-3.1-mini-4k-instruct model is designed for a variety of commercial and research applications, particularly in environments with limited memory or computational resources, scenarios requiring low latency, and tasks involving robust reasoning capabilities, such as mathematics and logic.

The model's key capabilities include:

1. Instruction following
2. Structured output generation
3. High-quality multi-turn conversations
4. Explicit support for the <|system|> tag
5. Improved reasoning capabilities

## Use Cases

1. **Environments with Limited Resources**: Suitable for deployment on devices with limited memory or computational power, such as laptops, desktops, or edge devices.
2. **Low-Latency Applications**: Ideal for use cases where quick responses are critical, such as customer service chatbots or real-time text generation.
3. **Mathematics and Logic-Based Tasks**: Performs well on tasks requiring robust reasoning capabilities, including math problem-solving and logical inference.
4. **Processing and Analyzing Long-Form Text**: Able to handle and analyze large amounts of text efficiently.

## Model Variants

We offer three quantized versions of the Phi-3.1-mini-4k-instruct model:

1. **Q5_KM**: 5-bit quantization using the KM method
2. **Q4_KM**: 4-bit quantization using the KM method
3. **IQ4_XS**: 4-bit quantization using the IQ4_XS method

These quantized models aim to reduce model size and improve inference speed while maintaining performance as close to the original model as possible.

## Input and Output

- **Input**: Text string (e.g., instructions, prompts, or long-form text)
- **Output**: Generated text following the input, with structured output, improved reasoning, and adherence to the <|system|> tag


## Usage

```bash
pip install llama-cpp-python
```
Please refer to the llama-cpp-python [documentation](https://llama-cpp-python.readthedocs.io/en/latest/) to install with GPU support.

### Basic Text Completion
Here's an example demonstrating how to use the high-level API for basic text completion:

```bash
from llama_cpp import Llama


llm = Llama(
  model_path="./Phi-3-mini-4k-instruct-q4.gguf",  # path to GGUF file
  n_ctx=4096,  # The max sequence length to use - note that longer sequence lengths require much more resources
  n_threads=8, # The number of CPU threads to use, tailor to your system and the resulting performance
  n_gpu_layers=35, # The number of layers to offload to GPU, if you have GPU acceleration available. Set to 0 if no GPU acceleration is available on your system.
)

prompt = "How to explain Internet to a medieval knight?"

# Simple inference example
output = llm(
  f"<|user|>\n{prompt}<|end|>\n<|assistant|>",
  max_tokens=256,  # Generate up to 256 tokens
  stop=["<|end|>"], 
  echo=True,  # Whether to echo the prompt
)

print(output['choices'][0]['text'])

```

## Download
You can download `Llama` models in `gguf` format directly from Hugging Face using the `from_pretrained` method. This feature requires the `huggingface-hub` package.

To install it, run: `pip install huggingface-hub`

```bash
from llama_cpp import Llama

llm = Llama.from_pretrained(
    repo_id="SandLogicTechnologies/Phi-3.1-mini-4k-instruct-GGUF",
    filename="*Phi-3.1-mini-4k-instruct-Q5_K_M.gguf",
    verbose=False
)
```
By default, from_pretrained will download the model to the Hugging Face cache directory. You can manage installed model files using the huggingface-cli tool.


## License

Phi-3-mini-4k-instruct license - [Phi-3](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct/blob/main/LICENSE)

## Acknowledgements

We thank the Microsoft team for developing and releasing the original Phi-3.1-mini-4k-instruct model.
Special thanks to Georgi Gerganov and the entire llama.cpp development team for their outstanding contributions.
## Contact

For any inquiries or support, please contact us at support@sandlogic.com or visit our [support page](https://www.sandlogic.com/LingoForge/support).