duyntnet commited on
Commit
df7dabf
1 Parent(s): 497399d

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +123 -0
README.md ADDED
@@ -0,0 +1,123 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ language:
4
+ - en
5
+ pipeline_tag: text-generation
6
+ inference: false
7
+ tags:
8
+ - transformers
9
+ - gguf
10
+ - imatrix
11
+ - Nemotron-Mini-4B-Instruct
12
+ ---
13
+ Quantizations of https://huggingface.co/nvidia/Nemotron-Mini-4B-Instruct
14
+
15
+
16
+ ### Inference Clients/UIs
17
+ * [llama.cpp](https://github.com/ggerganov/llama.cpp)
18
+ * [KoboldCPP](https://github.com/LostRuins/koboldcpp)
19
+ * [ollama](https://github.com/ollama/ollama)
20
+ * [text-generation-webui](https://github.com/oobabooga/text-generation-webui)
21
+ * [GPT4All](https://github.com/nomic-ai/gpt4all)
22
+ * [jan](https://github.com/janhq/jan)
23
+ ---
24
+
25
+ # From original readme
26
+
27
+ ## Model Overview
28
+
29
+ Nemotron-Mini-4B-Instruct is a model for generating responses for roleplaying, retrieval augmented generation, and function calling. It is a small language model (SLM) optimized through distillation, pruning and quantization for speed and on-device deployment. It is a fine-tuned version of [nvidia/Minitron-4B-Base](https://huggingface.co/nvidia/Minitron-4B-Base), which was pruned and distilled from [Nemotron-4 15B](https://arxiv.org/abs/2402.16819) using [our LLM compression technique](https://arxiv.org/abs/2407.14679). This instruct model is optimized for roleplay, RAG QA, and function calling in English. It supports a context length of 4,096 tokens. This model is ready for commercial use.
30
+
31
+ Try this model on [build.nvidia.com](https://build.nvidia.com/nvidia/nemotron-mini-4b-instruct).
32
+
33
+ For more details about how this model is used for [NVIDIA ACE](https://developer.nvidia.com/ace), please refer to [this blog post](https://developer.nvidia.com/blog/deploy-the-first-on-device-small-language-model-for-improved-game-character-roleplay/) and [this demo video](https://www.youtube.com/watch?v=d5z7oIXhVqg), which showcases how the model can be integrated into a video game. You can download the model checkpoint for NVIDIA AI Inference Manager (AIM) SDK from [here](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ucs-ms/resources/nemotron-mini-4b-instruct).
34
+
35
+ **Model Developer:** NVIDIA
36
+
37
+ **Model Dates:** Nemotron-Mini-4B-Instruct was trained between February 2024 and Aug 2024.
38
+
39
+ ## License
40
+
41
+ [NVIDIA Community Model License](https://huggingface.co/nvidia/Nemotron-Mini-4B-Instruct/blob/main/nvidia-community-model-license-aug2024.pdf)
42
+
43
+ ## Model Architecture
44
+
45
+ Nemotron-Mini-4B-Instruct uses a model embedding size of 3072, 32 attention heads, and an MLP intermediate dimension of 9216. It also uses Grouped-Query Attention (GQA) and Rotary Position Embeddings (RoPE).
46
+
47
+ **Architecture Type:** Transformer Decoder (auto-regressive language model)
48
+
49
+ **Network Architecture:** Nemotron-4
50
+
51
+
52
+ ## Prompt Format:
53
+
54
+ We recommend using the following prompt template, which was used to fine-tune the model. The model may not perform optimally without it.
55
+
56
+ **Single Turn**
57
+
58
+ ```
59
+ <extra_id_0>System
60
+ {system prompt}
61
+
62
+ <extra_id_1>User
63
+ {prompt}
64
+ <extra_id_1>Assistant\n
65
+ ```
66
+
67
+ **Tool use**
68
+
69
+ ```
70
+ <extra_id_0>System
71
+ {system prompt}
72
+
73
+ <tool> ... </tool>
74
+ <context> ... </context>
75
+
76
+ <extra_id_1>User
77
+ {prompt}
78
+ <extra_id_1>Assistant
79
+ <toolcall> ... </toolcall>
80
+ <extra_id_1>Tool
81
+ {tool response}
82
+ <extra_id_1>Assistant\n
83
+ ```
84
+
85
+
86
+ ## Usage
87
+
88
+ ```
89
+ from transformers import AutoTokenizer, AutoModelForCausalLM
90
+
91
+ # Load the tokenizer and model
92
+ tokenizer = AutoTokenizer.from_pretrained("nvidia/Nemotron-Mini-4B-Instruct")
93
+ model = AutoModelForCausalLM.from_pretrained("nvidia/Nemotron-Mini-4B-Instruct")
94
+
95
+ # Use the prompt template
96
+ messages = [
97
+ {
98
+ "role": "system",
99
+ "content": "You are a friendly chatbot who always responds in the style of a pirate",
100
+ },
101
+ {"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
102
+ ]
103
+ tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
104
+
105
+ outputs = model.generate(tokenized_chat, max_new_tokens=128)
106
+ print(tokenizer.decode(outputs[0]))
107
+ ```
108
+
109
+ You can also use `pipeline` but you need to create a tokenizer object and assign it to the pipeline manually.
110
+
111
+ ```
112
+ from transformers import AutoTokenizer
113
+ from transformers import pipeline
114
+
115
+ tokenizer = AutoTokenizer.from_pretrained("nvidia/Nemotron-Mini-4B-Instruct")
116
+
117
+ messages = [
118
+ {"role": "user", "content": "Who are you?"},
119
+ ]
120
+ pipe = pipeline("text-generation", model="nvidia/Nemotron-Mini-4B-Instruct")
121
+ pipe.tokenizer = tokenizer # You need to assign tokenizer manually
122
+ pipe(messages)
123
+ ```