Upload README.md with huggingface_hub
Browse files
README.md
ADDED
@@ -0,0 +1,123 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: other
|
3 |
+
language:
|
4 |
+
- en
|
5 |
+
pipeline_tag: text-generation
|
6 |
+
inference: false
|
7 |
+
tags:
|
8 |
+
- transformers
|
9 |
+
- gguf
|
10 |
+
- imatrix
|
11 |
+
- Nemotron-Mini-4B-Instruct
|
12 |
+
---
|
13 |
+
Quantizations of https://huggingface.co/nvidia/Nemotron-Mini-4B-Instruct
|
14 |
+
|
15 |
+
|
16 |
+
### Inference Clients/UIs
|
17 |
+
* [llama.cpp](https://github.com/ggerganov/llama.cpp)
|
18 |
+
* [KoboldCPP](https://github.com/LostRuins/koboldcpp)
|
19 |
+
* [ollama](https://github.com/ollama/ollama)
|
20 |
+
* [text-generation-webui](https://github.com/oobabooga/text-generation-webui)
|
21 |
+
* [GPT4All](https://github.com/nomic-ai/gpt4all)
|
22 |
+
* [jan](https://github.com/janhq/jan)
|
23 |
+
---
|
24 |
+
|
25 |
+
# From original readme
|
26 |
+
|
27 |
+
## Model Overview
|
28 |
+
|
29 |
+
Nemotron-Mini-4B-Instruct is a model for generating responses for roleplaying, retrieval augmented generation, and function calling. It is a small language model (SLM) optimized through distillation, pruning and quantization for speed and on-device deployment. It is a fine-tuned version of [nvidia/Minitron-4B-Base](https://huggingface.co/nvidia/Minitron-4B-Base), which was pruned and distilled from [Nemotron-4 15B](https://arxiv.org/abs/2402.16819) using [our LLM compression technique](https://arxiv.org/abs/2407.14679). This instruct model is optimized for roleplay, RAG QA, and function calling in English. It supports a context length of 4,096 tokens. This model is ready for commercial use.
|
30 |
+
|
31 |
+
Try this model on [build.nvidia.com](https://build.nvidia.com/nvidia/nemotron-mini-4b-instruct).
|
32 |
+
|
33 |
+
For more details about how this model is used for [NVIDIA ACE](https://developer.nvidia.com/ace), please refer to [this blog post](https://developer.nvidia.com/blog/deploy-the-first-on-device-small-language-model-for-improved-game-character-roleplay/) and [this demo video](https://www.youtube.com/watch?v=d5z7oIXhVqg), which showcases how the model can be integrated into a video game. You can download the model checkpoint for NVIDIA AI Inference Manager (AIM) SDK from [here](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ucs-ms/resources/nemotron-mini-4b-instruct).
|
34 |
+
|
35 |
+
**Model Developer:** NVIDIA
|
36 |
+
|
37 |
+
**Model Dates:** Nemotron-Mini-4B-Instruct was trained between February 2024 and Aug 2024.
|
38 |
+
|
39 |
+
## License
|
40 |
+
|
41 |
+
[NVIDIA Community Model License](https://huggingface.co/nvidia/Nemotron-Mini-4B-Instruct/blob/main/nvidia-community-model-license-aug2024.pdf)
|
42 |
+
|
43 |
+
## Model Architecture
|
44 |
+
|
45 |
+
Nemotron-Mini-4B-Instruct uses a model embedding size of 3072, 32 attention heads, and an MLP intermediate dimension of 9216. It also uses Grouped-Query Attention (GQA) and Rotary Position Embeddings (RoPE).
|
46 |
+
|
47 |
+
**Architecture Type:** Transformer Decoder (auto-regressive language model)
|
48 |
+
|
49 |
+
**Network Architecture:** Nemotron-4
|
50 |
+
|
51 |
+
|
52 |
+
## Prompt Format:
|
53 |
+
|
54 |
+
We recommend using the following prompt template, which was used to fine-tune the model. The model may not perform optimally without it.
|
55 |
+
|
56 |
+
**Single Turn**
|
57 |
+
|
58 |
+
```
|
59 |
+
<extra_id_0>System
|
60 |
+
{system prompt}
|
61 |
+
|
62 |
+
<extra_id_1>User
|
63 |
+
{prompt}
|
64 |
+
<extra_id_1>Assistant\n
|
65 |
+
```
|
66 |
+
|
67 |
+
**Tool use**
|
68 |
+
|
69 |
+
```
|
70 |
+
<extra_id_0>System
|
71 |
+
{system prompt}
|
72 |
+
|
73 |
+
<tool> ... </tool>
|
74 |
+
<context> ... </context>
|
75 |
+
|
76 |
+
<extra_id_1>User
|
77 |
+
{prompt}
|
78 |
+
<extra_id_1>Assistant
|
79 |
+
<toolcall> ... </toolcall>
|
80 |
+
<extra_id_1>Tool
|
81 |
+
{tool response}
|
82 |
+
<extra_id_1>Assistant\n
|
83 |
+
```
|
84 |
+
|
85 |
+
|
86 |
+
## Usage
|
87 |
+
|
88 |
+
```
|
89 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
90 |
+
|
91 |
+
# Load the tokenizer and model
|
92 |
+
tokenizer = AutoTokenizer.from_pretrained("nvidia/Nemotron-Mini-4B-Instruct")
|
93 |
+
model = AutoModelForCausalLM.from_pretrained("nvidia/Nemotron-Mini-4B-Instruct")
|
94 |
+
|
95 |
+
# Use the prompt template
|
96 |
+
messages = [
|
97 |
+
{
|
98 |
+
"role": "system",
|
99 |
+
"content": "You are a friendly chatbot who always responds in the style of a pirate",
|
100 |
+
},
|
101 |
+
{"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
|
102 |
+
]
|
103 |
+
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
|
104 |
+
|
105 |
+
outputs = model.generate(tokenized_chat, max_new_tokens=128)
|
106 |
+
print(tokenizer.decode(outputs[0]))
|
107 |
+
```
|
108 |
+
|
109 |
+
You can also use `pipeline` but you need to create a tokenizer object and assign it to the pipeline manually.
|
110 |
+
|
111 |
+
```
|
112 |
+
from transformers import AutoTokenizer
|
113 |
+
from transformers import pipeline
|
114 |
+
|
115 |
+
tokenizer = AutoTokenizer.from_pretrained("nvidia/Nemotron-Mini-4B-Instruct")
|
116 |
+
|
117 |
+
messages = [
|
118 |
+
{"role": "user", "content": "Who are you?"},
|
119 |
+
]
|
120 |
+
pipe = pipeline("text-generation", model="nvidia/Nemotron-Mini-4B-Instruct")
|
121 |
+
pipe.tokenizer = tokenizer # You need to assign tokenizer manually
|
122 |
+
pipe(messages)
|
123 |
+
```
|