File size: 6,840 Bytes
4e18030
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1e23866
4e18030
 
 
 
 
 
e5c2754
4e18030
 
 
 
 
 
 
1e23866
4e18030
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1e23866
4e18030
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e5c2754
4e18030
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e5c2754
 
 
 
4e18030
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4ab9e15
 
 
e5c2754
4ab9e15
08de38c
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
---

language:
- no  # Generic Norwegian
- nb  # Norwegian Bokmål
- nn  # Norwegian Nynorsk
- en  # English
- sv  # Swedish
- da  # Danish
tags:
- norwegian
- bokmål
- nynorsk
- swedish
- danish
- multilingual
- text-generation
pipeline_tag: text-generation
license: llama3.1
---

## Model Card: NB-Llama-3.1-70B-sft

---

### Model Overview
This is the SFT-version of the NB-Llama-models. This means the model has gone through supervised finetuning, and it now understands a basic template. Note that this model has not yet been aligned, so it will behave fairly unpredictable. It is most suited for additional fine tuning. 

**NB-Llama-3.1-70B-sft** is part of the **NB-Llama-3.1** series of models, trained on top of [NB-Llama-3.1-70B](https://huggingface.co/NbAiLab/Llama-3.1-70B). This multilingual generative model was fine-tuned specifically to support Norwegian Bokmål, Norwegian Nynorsk, and English, with partial support for Swedish and Danish.

The basic idea with this model series was to explore how current state-of-the-art models could be improved for Norwegian by training only on publicly available data. While these models are trained by the National Library of Norway, they do not include data only available through legal deposit. They do, however, contain public data like governmental reports that are both publicly available and legally deposited.

---

### Key Features

- **Base Model**: Built on NB-Llama-3.1-70B.
- **Languages**:
  - Full support: Norwegian Bokmål (nb), Norwegian Nynorsk (nn), English (en).
  - Partial support: Swedish (sv), Danish (da).
- **Purpose**: Supports Norwegian-specific tasks such as question-answering, summarization, and language modeling, while being capable of multilingual generation and translation. Efforts have been made to preserve the English capabilities from the underlying Meta Llama model.
- **Training Data**: Combines publicly available multilingual datasets with synthetic data generation, focusing on Norwegian, English, Swedish, and Danish sources. Additional details are provided below.
- **Architecture**: The model uses the Llama 3.1 architecture. It is an auto-regressive language model with an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and Direct Preference Optimization (DPO) for alignment.

---

### Model Details

- **Developer**: National Library of Norway (NB-AiLab).
- **Parameters**: 70 billion.
- **Knowledge Cutoff**: May 2024.
- **License**: [Llama 3.1 Community License](https://github.com/meta-llama/llama-models/blob/main/models/llama3.1/LICENSE).

---

### Motivation

The primary goal of **NB-Llama-3.1-70B-sft** is to advance support for Norwegian language technologies and strengthen support for Norwegian Bokmål and Norwegian Nynorsk. Since much knowledge and culture are also expressed in English, Swedish, and Danish, open sources in these languages are included in the training datasets when possible.

---

### Intended Use

#### Use Cases

- Dialogue systems.
- General multilingual text generation and language modeling.
- Norwegian-specific tasks such as:
  - Summarization of texts in Bokmål or Nynorsk.
  - Question-answering tailored to Norwegian cultural and linguistic contexts.

#### Out-of-Scope

- Use in violation of applicable laws or regulations.
- Tasks outside the supported languages without additional fine-tuning.
- High-risk domains without appropriate safety measures.

---

### How to Use

Please note tht this is still a research project, and the purpose of releasing the models are to investigate the potential in adapting these models for Norwegian language. The intented use case is experiemental. For end-users, we strongly recommend using the instruction-tuned models. We provide quantized models with close to the same accuracy that will run much faster on most platforms. When fine-tuning the instruction-tuned models, best results are obtained when applying the appropriate templates from Llama 3.1.

#### Using `transformers`

```python
import torch
from transformers import pipeline

model_id = "NbAiLab/nb-llama-3.1-70B-sft"
pipe = pipeline(
    "text-generation",
    model=model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
messages = [
    {"role": "user", "content": "Hvem er du?"},
]
outputs = pipe(
    messages,
    max_new_tokens=256,
)
print(outputs[0]["generated_text"][-1])

```
---

### Training Data

**Overview:**

The training data is based entirely on publicly available datasets and synthetically generated data. A key aspect of the training process was leveraging high-quality knowledge sources in Norwegian, English, Swedish, and Danish.

Parts of the following publicly available datasets were used:

- [CulturaX](https://huggingface.co/datasets/uonlp/CulturaX)
- [High Performance Language Technologies (HPLT)](https://huggingface.co/datasets/HPLT/hplt_monolingual_v1_2)
- [Norwegian Colossal Corpus (NCC)](https://huggingface.co/datasets/NCC/Norwegian-Colossal-Corpus)
- [Wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia)

---

### Data Selection

To ensure the highest quality training data, only a small subset of the original raw data was used. [Corpus Quality Classifiers](https://huggingface. 
co/collections/NbAiLab/corpus-quality-classifier-673f15926c2774fcc88f23aa) built on [nb-bert-base](https://huggingface.co/NbAiLab/nb-bert-base) were  
trained to evaluate both educational value and linguistic quality of the training samples. These models are released along with the NB-Llama-3.x      models, and are considered the main output from this initiative.


- **Categorization Methods:**
  - Inspired by the [FineWeb](https://example.com/FineWeb) project.
  - Evaluated for:
    - **Educational Value:** Prioritizing high-value training samples.
    - **Linguistic Quality:** Ensuring clarity and accuracy in training data.
- **Guidance and Release:**
  - Categorization was guided by insights from [Gemini 1.5](https://blog.google/technology/ai/google-gemini-next-generation-model-february-2024/#gemini-15).
  - The classifiers are released alongside this model and are [available here](https://classifier-release-link-here).

---

### Licensing

The model is released under the [Llama 3.1 Community License](https://github.com/meta-llama/llama-models/blob/main/models/llama3.1/LICENSE), allowing for research and commercial use within defined limitations. Refer to the [Acceptable Use Policy](https://llama.meta.com/llama3.1/use-policy) for specific restrictions.

---

### Citing & Authors
The model was trained and documentation written by Per Egil Kummervold as part of the NoTraM-project.

---

### Funding and Acknowledgement
Training this model was supported by Google’s TPU Research Cloud (TRC), which generously supplied us with Cloud TPUs essential for our computational
needs.