norbloom-7b-scratch / README.md
oepen's picture
cosmetics
7d39a24 verified
|
raw
history blame
13.5 kB
metadata
language:
  - 'no'
  - nb
  - nn
inference: true
tags:
  - gpt
  - generative
  - bloom
license: cc-by-4.0
pipeline_tag: text-generation
datasets:
  - uonlp/CulturaX
  - NbAiLab/NCC
  - vikp/starcoder_filtered

NorBLOOM-7b-scratch

NorBLOOM-7b-scratch is a large Norwegian language model pretrained from scratch on a total of 260 billion subword tokens (using six repetitions of open Norwegian texts).

This model is a part of the NORA-LLM family developed in collaboration between the Language Technology Group at the University of Oslo, the High Performance Language Technologies (HPLT) project, the National Library of Norway, and the University of Turku. All the models are pre-trained on the same dataset and with the same tokenizer. NorBLOOM-7b-scratch has around 7 billion parameters and is based on the BLOOM architecture.

The NORA-LLM language model family includes (as of now):

Disclaimer: This model is pretrained on raw (mostly web-based) textual data. It is not finetuned to follow instructions, and it can generate harmful completions after inappropriate user prompts. It is primarily intended for research purposes.


Pretraining corpus

The model is pretrained exclusively on publicly available data. We combine the resources from the public part of the NCC corpus, from the cleaned HPLT corpus, and from CulturaX. This resulted in over 34B subword tokens of Norwegian (Bokmål or Nynorsk) in total, which amounts to about 26.7B whitespace-separated tokens. We also augment the corpus with Starcoder; 20% of the 260B tokens are sampled from this code corpus. The natural language data is repeated six times to get the pretraining budget of 260B tokens, in accordance with findings from Muennighoff et al. (2023).


Model details

Model Developers: Language Technology Group at the University of Oslo.

Input: Textual input.

Output: Generated text.

Model Architecture: NorBLOOM is an auto-regressive language model that uses the BLOOM architecture.

Training Data Params Context Length Tokens LR
NorMistral-7b-warm NCC+HPLT+CulturaX+Starcoder 7B 2k 260B 1.0 x 10-4
NorMistral-7b-scratch NCC+HPLT+CulturaX+Starcoder 7B 2k 260B 3.0 x 10-4
NorBLOOM-7b-scratch NCC+HPLT+CulturaX+Starcoder 7B 2k 260B 1.2 x 10-4

Tokenizer: Byte-based BPE tokenizer trained on the same Norwegian corpus as this model. The vocabulary size is 32,768 tokens.

Training FLOPs The approximate amount is 1.12e+22 FLOPs; calculated as in Chowdhery et al. (2022).

Model Dates: This model was pretrained in December 2023.

Status: This is only a pretrained language model; an instruction-finetuned model will follow soon.

License: Creative Commons Attribution 4.0

Research Paper: Forthcoming


Initial evaluation

Disclaimer: our model evaluation is an ongoing phase and is not claimed to be exhaustive. We provide our initial evaluation results on standard natural language understanding and generation tasks, and our evaluation design will be extended. The user should perform evaluation for their particular model application scenario, including safety and bias evaluations.

The perplexity on the heldout validation set from the Norwegian Colossal Corpus (NCC) is 7.43 and the final training perplexity is 4.76.

Our initial downstream evaluation is conducted on reading comprehension, sentiment analysis and machine translation tasks using open-source peer-reviewed datasets and benchmarks in native Norwegian. We release our codebase here. We compare against other pretrained generative language models that officially support Norwegian: NB-GPT-J, GPT-Sw3 6.7B, GPT-Sw3 6.7B v2, and Falcon-7B.

Reading comprehension

NorQuAD (Ivanova et al., 2023) is a dataset for extractive question answering in Norwegian designed similarly to SQuAD (Rajpurkar et al., 2016).

Method
  • Evaluation setting: zero-shot and few-shot settings via natural language generation using the greedy decoding strategy.
  • Prompt: "Tittel: {title}\n\nTekst: {text}\n\nSpørsmål: {question}\n\nSvar:{answer}"
  • Few-shot results show the average scores across 5 repetitions
  • Evaluation script: https://github.com/ltgoslo/norallm/blob/main/initial_evaluation/norquad.py
  • Performance metrics: macro-averaged F1-score and exact match (EM).
Performance results on the extractive question answering task (NorQuAD)
Model 0-shot (F1/EM) 1-shot (F1/EM) 2-shot (F1/EM)
NorMistral-7b-warm 48.6/24.8 63.6/40.0 66.5/43.8
NorMistral-7b-scratch 34.0/15.7 46.5/25.8 48.5/27.8
NorBLOOM-7b 35.0/13.3 47.7/28.0 49.3/30.1
NB-GPT-J 24.4/6.8 32.8/11.6 35.0/12.3
Falcon-7B 15.8/7.0 27.3/13.9 27.4/13.1
GPT-Sw3-6.7B 46.5/22.0 55.9/32.0 58.1/34.3
GPT-Sw3-6.7B-v2 46.9/22.5 61.1/38.9 66.0/44.5

Sentiment analysis

NoReC (Øvrelid et al., 2020) is a dataset for sentence-level sentiment analysis derived from the Norwegian Review Corpus (Velldal et al., 2018). We use the binary formulation of this task (positive vs. negative).

Method
Macro-averaged F1-scores on the sentence-level sentiment analysis task (NoReC)
Model 0-shot (macro F1) 1-shot (macro F1) 16-shot (macro F1)
NorMistral-7b-warm 60.6 77.8 87.3
NorMistral-7b-scratch 47.3 62.2 80.1
NorBLOOM-7b 75.7 73.8 65.5
NB-GPT-J 48.4 56.5 65.2
Falcon-7B 53.3 61.6 74.9
GPT-Sw3-6.7B 61.5 72.2 76.5
GPT-Sw3-6.7B-v2 42.4 69.1 83.4

Machine translation

Tatoeba (Tiedemann, 2020) is a benchmark for machine translation, which includes hundreds of language pairs. We consider six language pairs (English <-> Bokmål, English <-> Nynorsk, and Bokmål <-> Nynorsk).

Method
English → Norwegian Bokmål
Model 0-shot (BLEU/chrF++) 1-shot (BLEU/chrF++) 5-shot (BLEU/chrF++)
NorMistral-7b-warm 55.8/70.7 56.7/71.5 57.7/72.4
NorMistral-7b-scratch 46.4/62.9 50.4/66.3 52.1/67.6
NorBLOOM-7b 37.1/53.6 50.1/65.8 52.0/67.6
NB-GPT-J 8.6/39.1 35.9/64.5 47.2/68.7
Falcon-7B 19.1/40.1 20.6/41.8 22.1/43.6
GPT-Sw3-6.7B 21.8/55.2 54.5/69.6 58.6/73.2
GPT-Sw3-6.7B-v2 20.6/53.2 51.2/66.6 58.4/73.0
English → Norwegian Nynorsk
Model 0-shot (BLEU/chrF++) 1-shot (BLEU/chrF++) 5-shot (BLEU/chrF++)
NorMistral-7b-warm 43.6/62.0 44.2/63.2 44.3/63.7
NorMistral-7b-scratch 38.0/56.9 39.2/57.9 40.7/59.3
NorBLOOM-7b 35.6/54.7 36.6/56.3 38.1/57.4
NB-GPT-J 1.7/14.7 6.3/34.1 35.2/60.4
Falcon-7B 6.4/28.6 8.3/30.5 9.3/32.1
GPT-Sw3-6.7B 13.4/44.3 43.6/62.5 44.5/63.5
GPT-Sw3-6.7B-v2 14.8/45.5 43.7/62.3 44.0/63.6
Norwegian Bokmål → English
Model 0-shot (BLEU/chrF++) 1-shot (BLEU/chrF++) 5-shot (BLEU/chrF++)
NorMistral-7b-warm 55.1/68.4 55.5/69.5 56.0/69.8
NorMistral-7b-scratch 47.1/61.9 49.4/64.2 52.3/66.2
NorBLOOM-7b 45.0/59.3 48.3/64.0 49.0/64.7
NB-GPT-J 9.8/41.4 24.8/58.3 47.6/67.7
Falcon-7B 21.6/40.6 31.7/47.4 36.6/51.7
GPT-Sw3-6.7B 47.8/66.2 49.1/68.1 49.6/69.4
GPT-Sw3-6.7B-v2 46.3/67.5 48.9/69.3 58.2/72.8
Norwegian Nynorsk → English
Model 0-shot (BLEU/chrF++) 1-shot (BLEU/chrF++) 5-shot (BLEU/chrF++)
NorMistral-7b-warm 55.1/68.4 55.5/69.5 56.0/69.8
NorMistral-7b-scratch 47.1/61.9 49.4/64.2 52.3/66.2
NorBLOOM-7b 45.0/59.3 48.3/64.0 49.0/64.7
NB-GPT-J 2.9/19.5 10.1/41.0 44.4/66.9
Falcon-7B 21.6/40.6 31.7/47.4 36.6/57.1
GPT-Sw3-6.7B 47.8/66.2 49.1/68.1 49.6/69.4
GPT-Sw3-6.7B-v2 46.3/67.5 48.9/69.3 58.2/72.8
Norwegian Bokmål → Norwegian Nynorsk
Model 0-shot (BLEU/chrF++) 1-shot (BLEU/chrF++) 5-shot (BLEU/chrF++)
NorMistral-7b-warm 75.8/87.5 74.0/86.9 75.3/87.5
NorMistral-7b-scratch 38.0/56.9 39.2/57.9 40.7/59.3
NorBLOOM-7b 71.5/84.4 70.1/84.1 71.9/85.1
NB-GPT-J 6.6/35.5 9.6/41.0 26.0/64.7
Falcon-7B 28.7/59.2 29.8/60.8 32.1/62.3
GPT-Sw3-6.7B 63.6/82.8 74.7/86.0 75.8/86.9
GPT-Sw3-6.7B-v2 57.5/81.1 75.3/86.7 76.7/87.6
Norwegian Nynorsk → Norwegian Bokmål
Model 0-shot (BLEU/chrF++) 1-shot (BLEU/chrF++) 5-shot (BLEU/chrF++)
NorMistral-7b-warm 88.1/93.6 89.2/94.3 89.3/94.6
NorMistral-7b-scratch 85.1/91.4 86.6/92.4 87.4/93.0
NorBLOOM-7b 78.7/88.5 84.2/90.7 87.4/93.0
NB-GPT-J 2.7/18.5 6.9/35.6 52.9/84.3
Falcon-7B 36.7/61.6 38.3/63.5 45.8/68.1
GPT-Sw3-6.7B 652.3/82.4 86.1/92.5 87.8/93.6
GPT-Sw3-6.7B-v2 72.0/88.6 86.1/92.5 88.2/93.9

Hardware and Software

Training Factors: The models were pretrained using the Megatron-DeepSpeed library on the LUMI cluster in Finland.

Carbon Footprint: Pretraining one model took approximately 70k GPU hours of computation on AMD MI250X GPUs (assuming 2 GPUs per one AMD MI250X device), each of which draws 500W. LUMI is one of the most eco-efficient data centers in the world, and its energy consumption is covered 100% with renewable electricity.


Example usage

Let's try to use this model for English-to-Norwegian machine translation using simple zero-shot prompting:

from transformers import AutoTokenizer, AutoModelForCausalLM

# First, we will have to import the tokenizer and the language model
tokenizer = AutoTokenizer.from_pretrained("norallm/norbloom-7b-scratch")
model = AutoModelForCausalLM.from_pretrained("norallm/norbloom-7b-scratch").cuda().eval()

# Now we will define the zero-shot prompt template
prompt = """Engelsk: {0}
Bokmål:"""

# A function that will take care of generating the output
@torch.no_grad()
def generate(text):
    text = prompt.format(text)
    input_ids = tokenizer(text, return_tensors='pt').input_ids.cuda()
    prediction = model.generate(
        input_ids,
        max_new_tokens=64,
        do_sample=False,
        eos_token_id=tokenizer('\n').input_ids
    )
    return tokenizer.decode(prediction[0, input_ids.size(1):]).strip()

# Now you can simply call the generate function with an English text you want to translate:
generate("I'm super excited about this Norwegian NORA model! Can it translate these sentences?")