--- license: other language: - en pipeline_tag: text-generation inference: false tags: - transformers - gguf - imatrix - Nemotron-Mini-4B-Instruct --- Quantizations of https://huggingface.co/nvidia/Nemotron-Mini-4B-Instruct ### Inference Clients/UIs * [llama.cpp](https://github.com/ggerganov/llama.cpp) * [KoboldCPP](https://github.com/LostRuins/koboldcpp) * [ollama](https://github.com/ollama/ollama) * [text-generation-webui](https://github.com/oobabooga/text-generation-webui) * [GPT4All](https://github.com/nomic-ai/gpt4all) * [jan](https://github.com/janhq/jan) --- # From original readme ## Model Overview Nemotron-Mini-4B-Instruct is a model for generating responses for roleplaying, retrieval augmented generation, and function calling. It is a small language model (SLM) optimized through distillation, pruning and quantization for speed and on-device deployment. It is a fine-tuned version of [nvidia/Minitron-4B-Base](https://huggingface.co/nvidia/Minitron-4B-Base), which was pruned and distilled from [Nemotron-4 15B](https://arxiv.org/abs/2402.16819) using [our LLM compression technique](https://arxiv.org/abs/2407.14679). This instruct model is optimized for roleplay, RAG QA, and function calling in English. It supports a context length of 4,096 tokens. This model is ready for commercial use. Try this model on [build.nvidia.com](https://build.nvidia.com/nvidia/nemotron-mini-4b-instruct). For more details about how this model is used for [NVIDIA ACE](https://developer.nvidia.com/ace), please refer to [this blog post](https://developer.nvidia.com/blog/deploy-the-first-on-device-small-language-model-for-improved-game-character-roleplay/) and [this demo video](https://www.youtube.com/watch?v=d5z7oIXhVqg), which showcases how the model can be integrated into a video game. You can download the model checkpoint for NVIDIA AI Inference Manager (AIM) SDK from [here](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ucs-ms/resources/nemotron-mini-4b-instruct). **Model Developer:** NVIDIA **Model Dates:** Nemotron-Mini-4B-Instruct was trained between February 2024 and Aug 2024. ## License [NVIDIA Community Model License](https://huggingface.co/nvidia/Nemotron-Mini-4B-Instruct/blob/main/nvidia-community-model-license-aug2024.pdf) ## Model Architecture Nemotron-Mini-4B-Instruct uses a model embedding size of 3072, 32 attention heads, and an MLP intermediate dimension of 9216. It also uses Grouped-Query Attention (GQA) and Rotary Position Embeddings (RoPE). **Architecture Type:** Transformer Decoder (auto-regressive language model) **Network Architecture:** Nemotron-4 ## Prompt Format: We recommend using the following prompt template, which was used to fine-tune the model. The model may not perform optimally without it. **Single Turn** ``` System {system prompt} User {prompt} Assistant\n ``` **Tool use** ``` System {system prompt} ... ... User {prompt} Assistant ... Tool {tool response} Assistant\n ``` ## Usage ``` from transformers import AutoTokenizer, AutoModelForCausalLM # Load the tokenizer and model tokenizer = AutoTokenizer.from_pretrained("nvidia/Nemotron-Mini-4B-Instruct") model = AutoModelForCausalLM.from_pretrained("nvidia/Nemotron-Mini-4B-Instruct") # Use the prompt template messages = [ { "role": "system", "content": "You are a friendly chatbot who always responds in the style of a pirate", }, {"role": "user", "content": "How many helicopters can a human eat in one sitting?"}, ] tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt") outputs = model.generate(tokenized_chat, max_new_tokens=128) print(tokenizer.decode(outputs[0])) ``` You can also use `pipeline` but you need to create a tokenizer object and assign it to the pipeline manually. ``` from transformers import AutoTokenizer from transformers import pipeline tokenizer = AutoTokenizer.from_pretrained("nvidia/Nemotron-Mini-4B-Instruct") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe = pipeline("text-generation", model="nvidia/Nemotron-Mini-4B-Instruct") pipe.tokenizer = tokenizer # You need to assign tokenizer manually pipe(messages) ```