Update README.md
Browse files
README.md
CHANGED
@@ -13,182 +13,21 @@ base_model: meta-llama/Llama-3.1-70B-Instruct
|
|
13 |
pipeline_tag: text-generation
|
14 |
library_name: transformers
|
15 |
---
|
16 |
-
# Model Overview
|
17 |
|
18 |
-
|
19 |
|
20 |
-
|
|
|
21 |
|
|
|
22 |
|
23 |
-
|
24 |
|
25 |
-
As of 1 Oct 2024, this model is #1 on all three automatic alignment benchmarks (verified tab for AlpacaEval 2 LC), edging out strong frontier models such as GPT-4o and Claude 3.5 Sonnet.
|
26 |
|
27 |
-
This model was trained using RLHF (specifically, REINFORCE), [Llama-3.1-Nemotron-70B-Reward](https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Reward) and [HelpSteer2-Preference prompts](https://huggingface.co/datasets/nvidia/HelpSteer2) on a Llama-3.1-70B-Instruct model as the initial policy.
|
28 |
-
|
29 |
-
Llama-3.1-Nemotron-70B-Instruct-HF has been converted from [Llama-3.1-Nemotron-70B-Instruct](https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct) to support it in the HuggingFace Transformers codebase. Please note that evaluation results might be slightly different from the [Llama-3.1-Nemotron-70B-Instruct](https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct) as evaluated in NeMo-Aligner, which the evaluation results below are based on.
|
30 |
-
|
31 |
-
Try hosted inference for free at [build.nvidia.com](https://build.nvidia.com/nvidia/llama-3_1-nemotron-70b-instruct) - it comes with an OpenAI-compatible API interface.
|
32 |
-
|
33 |
-
|
34 |
-
See details on our paper at [https://arxiv.org/abs/2410.01257](https://arxiv.org/abs/2410.01257) - as a preview, this model can correctly the question ```How many r in strawberry?``` without specialized prompting or additional reasoning tokens:
|
35 |
-
|
36 |
-
```
|
37 |
-
A sweet question!
|
38 |
-
Let’s count the “R”s in “strawberry”:
|
39 |
-
1. S
|
40 |
-
2. T
|
41 |
-
3. R
|
42 |
-
4. A
|
43 |
-
5. W
|
44 |
-
6. B
|
45 |
-
7. E
|
46 |
-
8. R
|
47 |
-
9. R
|
48 |
-
10. Y
|
49 |
-
There are **3 “R”s** in the word “strawberry”.
|
50 |
-
```
|
51 |
-
|
52 |
-
Note: This model is a demonstration of our techniques for improving helpfulness in general-domain instruction following. It has not been tuned for performance in specialized domains such as math.
|
53 |
-
|
54 |
-
|
55 |
-
## Terms of use
|
56 |
-
|
57 |
-
By accessing this model, you are agreeing to the LLama 3.1 terms and conditions of the [license](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/LICENSE), [acceptable use policy](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/USE_POLICY.md) and [Meta’s privacy policy](https://www.facebook.com/privacy/policy/)
|
58 |
-
|
59 |
-
|
60 |
-
## Evaluation Metrics
|
61 |
-
|
62 |
-
As of 1 Oct 2024, Llama-3.1-Nemotron-70B-Instruct performs best on Arena Hard, AlpacaEval 2 LC (verified tab) and MT Bench (GPT-4-Turbo)
|
63 |
-
|
64 |
-
| Model | Arena Hard | AlpacaEval | MT-Bench | Mean Response Length |
|
65 |
-
|:-----------------------------|:----------------|:-----|:----------|:-------|
|
66 |
-
|Details | (95% CI) | 2 LC (SE) | (GPT-4-Turbo) | (# of Characters for MT-Bench)|
|
67 |
-
| _**Llama-3.1-Nemotron-70B-Instruct**_ | **85.0** (-1.5, 1.5) | **57.6** (1.65) | **8.98** | 2199.8 |
|
68 |
-
| Llama-3.1-70B-Instruct | 55.7 (-2.9, 2.7) | 38.1 (0.90) | 8.22 | 1728.6 |
|
69 |
-
| Llama-3.1-405B-Instruct | 69.3 (-2.4, 2.2) | 39.3 (1.43) | 8.49 | 1664.7 |
|
70 |
-
| Claude-3-5-Sonnet-20240620 | 79.2 (-1.9, 1.7) | 52.4 (1.47) | 8.81 | 1619.9 |
|
71 |
-
| GPT-4o-2024-05-13 | 79.3 (-2.1, 2.0) | 57.5 (1.47) | 8.74 | 1752.2 |
|
72 |
-
|
73 |
-
## Usage:
|
74 |
-
|
75 |
-
You can use the model using HuggingFace Transformers library with 2 or more 80GB GPUs (NVIDIA Ampere or newer) with at least 150GB of free disk space to accomodate the download.
|
76 |
-
|
77 |
-
This code has been tested on Transformers v4.44.0, torch v2.4.0 and 2 A100 80GB GPUs, but any setup that supports ```meta-llama/Llama-3.1-70B-Instruct``` should support this model as well. If you run into problems, you can consider doing ```pip install -U transformers```.
|
78 |
-
|
79 |
-
|
80 |
-
```python
|
81 |
-
import torch
|
82 |
-
from transformers import AutoModelForCausalLM, AutoTokenizer
|
83 |
-
model_name = "nvidia/Llama-3.1-Nemotron-70B-Instruct-HF"
|
84 |
-
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")
|
85 |
-
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
86 |
-
|
87 |
-
prompt = "How many r in strawberry?"
|
88 |
-
messages = [{"role": "user", "content": prompt}]
|
89 |
-
|
90 |
-
tokenized_message = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt", return_dict=True)
|
91 |
-
response_token_ids = model.generate(tokenized_message['input_ids'].cuda(),attention_mask=tokenized_message['attention_mask'].cuda(), max_new_tokens=4096, pad_token_id = tokenizer.eos_token_id)
|
92 |
-
generated_tokens =response_token_ids[:, len(tokenized_message['input_ids'][0]):]
|
93 |
-
generated_text = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]
|
94 |
-
print(generated_text)
|
95 |
-
|
96 |
-
# See response at top of model card
|
97 |
-
```
|
98 |
-
|
99 |
-
|
100 |
-
|
101 |
-
## Contact
|
102 |
-
|
103 |
-
E-Mail: [Zhilin Wang](mailto:zhilinw@nvidia.com)
|
104 |
-
|
105 |
-
|
106 |
-
## Citation
|
107 |
-
|
108 |
-
If you find this model useful, please cite the following works
|
109 |
-
|
110 |
-
```bibtex
|
111 |
-
@misc{wang2024helpsteer2preferencecomplementingratingspreferences,
|
112 |
-
title={HelpSteer2-Preference: Complementing Ratings with Preferences},
|
113 |
-
author={Zhilin Wang and Alexander Bukharin and Olivier Delalleau and Daniel Egert and Gerald Shen and Jiaqi Zeng and Oleksii Kuchaiev and Yi Dong},
|
114 |
-
year={2024},
|
115 |
-
eprint={2410.01257},
|
116 |
-
archivePrefix={arXiv},
|
117 |
-
primaryClass={cs.LG},
|
118 |
-
url={https://arxiv.org/abs/2410.01257},
|
119 |
-
}
|
120 |
-
@misc{wang2024helpsteer2,
|
121 |
-
title={HelpSteer2: Open-source dataset for training top-performing reward models},
|
122 |
-
author={Zhilin Wang and Yi Dong and Olivier Delalleau and Jiaqi Zeng and Gerald Shen and Daniel Egert and Jimmy J. Zhang and Makesh Narsimhan Sreedhar and Oleksii Kuchaiev},
|
123 |
-
year={2024},
|
124 |
-
eprint={2406.08673},
|
125 |
-
archivePrefix={arXiv},
|
126 |
-
primaryClass={id='cs.CL' full_name='Computation and Language' is_active=True alt_name='cmp-lg' in_archive='cs' is_general=False description='Covers natural language processing. Roughly includes material in ACM Subject Class I.2.7. Note that work on artificial languages (programming languages, logics, formal systems) that does not explicitly address natural-language issues broadly construed (natural-language processing, computational linguistics, speech, text retrieval, etc.) is not appropriate for this area.'}
|
127 |
-
}
|
128 |
-
```
|
129 |
-
|
130 |
-
## References(s):
|
131 |
-
|
132 |
-
* [HelpSteer2-Preference](https://arxiv.org/abs/2410.01257)
|
133 |
-
* [SteerLM method](https://arxiv.org/abs/2310.05344)
|
134 |
-
* [HelpSteer](https://arxiv.org/abs/2311.09528)
|
135 |
-
* [HelpSteer2](https://arxiv.org/abs/2406.08673)
|
136 |
-
* [Introducing Llama 3.1: Our most capable models to date](https://ai.meta.com/blog/meta-llama-3-1/)
|
137 |
-
* [Meta's Llama 3.1 Webpage](https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_1)
|
138 |
-
* [Meta's Llama 3.1 Model Card](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/MODEL_CARD.md)
|
139 |
-
|
140 |
-
|
141 |
-
## Model Architecture:
|
142 |
-
**Architecture Type:** Transformer <br>
|
143 |
-
**Network Architecture:** Llama 3.1 <br>
|
144 |
-
|
145 |
-
## Input:
|
146 |
-
**Input Type(s):** Text <br>
|
147 |
-
**Input Format:** String <br>
|
148 |
-
**Input Parameters:** One Dimensional (1D) <br>
|
149 |
-
**Other Properties Related to Input:** Max of 128k tokens<br>
|
150 |
-
|
151 |
-
## Output:
|
152 |
-
**Output Type(s):** Text <br>
|
153 |
-
**Output Format:** String <br>
|
154 |
-
**Output Parameters:** One Dimensional (1D) <br>
|
155 |
-
**Other Properties Related to Output:** Max of 4k tokens <br>
|
156 |
-
|
157 |
-
|
158 |
-
## Software Integration:
|
159 |
-
**Supported Hardware Microarchitecture Compatibility:** <br>
|
160 |
-
* NVIDIA Ampere <br>
|
161 |
-
* NVIDIA Hopper <br>
|
162 |
-
* NVIDIA Turing <br>
|
163 |
-
**Supported Operating System(s):** Linux <br>
|
164 |
-
|
165 |
-
## Model Version:
|
166 |
-
v1.0
|
167 |
-
|
168 |
-
# Training & Evaluation:
|
169 |
-
|
170 |
-
## Datasets:
|
171 |
-
|
172 |
-
**Data Collection Method by dataset** <br>
|
173 |
-
* [Hybrid: Human, Synthetic] <br>
|
174 |
-
|
175 |
-
**Labeling Method by dataset** <br>
|
176 |
-
* [Human] <br>
|
177 |
-
|
178 |
-
**Link:**
|
179 |
-
* [HelpSteer2](https://huggingface.co/datasets/nvidia/HelpSteer2)
|
180 |
-
|
181 |
-
**Properties (Quantity, Dataset Descriptions, Sensor(s)):** <br>
|
182 |
-
* 21, 362 prompt-responses built to make more models more aligned with human preference - specifically more helpful, factually-correct, coherent, and customizable based on complexity and verbosity.
|
183 |
-
* 20, 324 prompt-responses used for training and 1, 038 used for validation.
|
184 |
-
|
185 |
-
|
186 |
-
# Inference:
|
187 |
-
**Engine:** [Triton](https://developer.nvidia.com/triton-inference-server) <br>
|
188 |
-
**Test Hardware:** H100, A100 80GB, A100 40GB <br>
|
189 |
|
|
|
190 |
|
191 |
-
|
192 |
-
|
193 |
|
194 |
-
|
|
|
13 |
pipeline_tag: text-generation
|
14 |
library_name: transformers
|
15 |
---
|
|
|
16 |
|
17 |
+
Quantized model => https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct-HF
|
18 |
|
19 |
+
**Quantization Details:**
|
20 |
+
Quantization is done using turboderp's ExLlamaV2 v0.2.2.
|
21 |
|
22 |
+
I use the default calibration datasets and arguments. The repo also includes a "measurement.json" file, which was used during the quantization process.
|
23 |
|
24 |
+
For models with bits per weight (BPW) over 6.0, I default to quantizing the `lm_head` layer at 8 bits instead of the standard 6 bits.
|
25 |
|
|
|
26 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
27 |
|
28 |
+
---
|
29 |
|
30 |
+
**Who are you? What's with these weird BPWs on [insert model here]?**
|
31 |
+
I specialize in optimized EXL2 quantization for models in the 70B to 100B+ range, specifically tailored for 48GB VRAM setups. My rig is built using 2 x 3090s with a Ryzen APU (APU used solely for desktop output—no VRAM wasted on the 3090s). I use TabbyAPI for inference, targeting context sizes between 32K and 64K.
|
32 |
|
33 |
+
Every model I upload includes a `config.yml` file with my ideal TabbyAPI settings. If you're using my config, don’t forget to set `PYTORCH_CUDA_ALLOC_CONF=backend:cudaMallocAsync` to save some VRAM.
|