metadata

license: other
datasets:
  - databricks/databricks-dolly-15k
  - laion/OIG
  - OpenAssistant/oasst1
language:
  - da
  - sv
  - 'no'
  - en
  - is
pipeline_tag: text-generation

Model description

GPT-SW3 is a collection of large decoder-only pretrained transformer language models that were developed by AI Sweden in collaboration with RISE and the WASP WARA for Media and Language. GPT-SW3 has been trained on a dataset containing 320B tokens in Swedish, Norwegian, Danish, Icelandic, English, and programming code. The model was pretrained using a causal language modeling (CLM) objective utilizing the NeMo Megatron GPT implementation.

The instruct models were finetrained on instruction data using both chat and raw text formats.

Intended use

GPT-SW3 is an autoregressive large language model that is capable of generating coherent text in 5 different languages, and 4 programming languages. GPT-SW3 can also be instructed to perform text tasks that it has not been explicitly trained for, by casting them as text generation tasks. AI Sweden shares GPT-SW3 in a controlled pre-release with organizations and individuals in the Nordic NLP ecosystem who can contribute to the validation and testing of the models and provide feedback to the community. This is an important step in the process of validating the model and collecting feedback on both what works well and what does not.

Limitations

Like other large language models for which the diversity (or lack thereof) of training data induces downstream impact on the quality of our model, GPT-SW3 has limitations in terms of for example bias and safety. GPT-SW3 can also have quality issues in terms of generation diversity and hallucination. By releasing with the modified RAIL license, we also hope to increase communication, transparency, and the study of large language models. The model may: overrepresent some viewpoints and underrepresent others, contain stereotypes, generate hateful, abusive, violent, discriminatory or prejudicial language. The model may make errors, including producing incorrect information as if it were factual, it may generate irrelevant or repetitive outputs, and content that may not be appropriate for all settings, including sexual content.

How to use

To be able to access the model from Python, since this is a private repository, you have to log in with your access token. This can be done with huggingface-cli login, see HuggingFace Quick Start Guide for more information.

The following code snippet loads our tokenizer & model, and uses the GPU if available.

import torch
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM

# Initialize Variables
model_name = "AI-Sweden-Models/gpt-sw3-126m-instruct"
device = "cuda:0" if torch.cuda.is_available() else "cpu"
prompt = "Träd är fina för att"

# Initialize Tokenizer & Model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
model.eval()
model.to(device)

Generating text using the generate method is done as follows:

input_ids = tokenizer(prompt, return_tensors="pt")["input_ids"].to(device)

generated_token_ids = model.generate(
    inputs=input_ids,
    max_new_tokens=100,
    do_sample=True,
    temperature=0.6,
    top_p=1,
)[0]

generated_text = tokenizer.decode(generated_token_ids)

How to use for chat

The chat format used during data-preprocessing takes the form:

<|endoftext|><s>
User:
Jag tycker träd är fina
<s>
Bot:
Kul att du tycker det!
<s>
...

The procedure to generate text in chat format:

from transformers import StoppingCriteriaList, StoppingCriteria

prompt = """
<|endoftext|><s>
User:
Varför är träd fina?
<s>
Bot:
""".strip()

# (Optional) - define a stopping criteria
# We ideally want the model to stop generate once the response from the Bot is generated
class StopOnTokenCriteria(StoppingCriteria):
    def __init__(self, stop_token_id):
        self.stop_token_id = stop_token_id

    def __call__(self, input_ids, scores, **kwargs):
        return input_ids[0, -1] == self.stop_token_id

stop_on_token_criteria = StopOnTokenCriteria(stop_token_id=tokenizer.bos_token_id)
input_ids = tokenizer(prompt, return_tensors="pt")["input_ids"].to(device)

generated_token_ids = model.generate(
    inputs=input_ids,
    max_new_tokens=128,
    do_sample=True,
    temperature=0.6,
    top_p=1,
    stopping_criteria=StoppingCriteriaList([stop_on_token_criteria])
)[0]

generated_text = tokenizer.decode(generated_token_ids[len(input_ids[0]):-1])

Generating text using the generate method is done as follows:

input_ids = tokenizer(prompt, return_tensors="pt")["input_ids"].to(device)

generated_token_ids = model.generate(
    inputs=input_ids,
    max_new_tokens=100,
    do_sample=True,
    temperature=0.6,
    top_p=1,
)[0]

A convenient alternative to the `generate` method is the HuggingFace pipeline, which handles most of the work for you:
```python
generator = pipeline('text-generation', tokenizer=tokenizer, model=model, device=device)
generated = generator(prompt, max_new_tokens=100, do_sample=True, temperature=0.6, top_p=1)[0]["generated_text"]

Compliance

The release of GPT-SW3 consists of model weights, a configuration file, a tokenizer file and a vocabulary file. None of these files contain any personally identifiable information (PII) or any copyrighted material.

GPT-SW3 Model Card

Following Mitchell et al. (2018), we provide a model card for GPT-SW3.

Model Details

Person or organization developing model: GPT-SW3 was developed by AI Sweden in collaboration with RISE and the WASP WARA for Media and Language.
Model date: GPT-SW3 date of release 2022-12-20
Model version: This is the second generation of GPT-SW3.
Model type: GPT-SW3 is a large decoder-only transformer language model.
Information about training algorithms, parameters, fairness constraints or other applied approaches, and features: GPT-SW3 was trained with the NeMo Megatron GPT implementation.
Paper or other resource for more information: N/A.
License: LICENSE.
Where to send questions or comments about the model: nlu@ai.se

Intended Use

Primary intended uses: We pre-release GPT-SW3 for research and evaluation of the capabilities of Large Language Models for the Nordic languages. This is an important step in the process of knowledge building for LLMs, validating the model and collecting feedback on both what works well and what does not.
Primary intended users: Organizations and individuals in the Nordic NLP ecosystem who can contribute to the validation and testing of the models and provide feedback to the community.
Out-of-scope use cases: See the modified RAIL license.

Data, Limitations, and Recommendations

Data selection for training: Training data for GPT-SW3 was selected based on a combination of breadth and availability. See our Datasheet for more detailed information on the data used to train our model.
Data selection for evaluation: N/A
Limitations: Like other large language models for which the diversity (or lack thereof) of training data induces downstream impact on the quality of our model, GPT-SW3 has limitations in terms of bias and safety. GPT-SW3 can also have quality issues in terms of generation diversity and hallucination. In general, GPT-SW3 is not immune from the plethora of issues that plague modern large language models. By releasing with the modified RAIL license, we also hope to increase communication, transparency, and the study of large language models. The model may: Overrepresent some viewpoints and underrepresent others. Contain stereotypes. Generate: Hateful, abusive, or violent language. Discriminatory or prejudicial language. Content that may not be appropriate for all settings, including sexual content. Make errors, including producing incorrect information as if it were factual. Generate irrelevant or repetitive outputs.
Recommendations for future work: Indirect users should be made aware when the content they're working with is created by the LLM. Users should be aware of Risks and Limitations, and include an appropriate age disclaimer or blocking interface as necessary. Models pretrained with the LLM should include an updated Model Card. Users of the model should provide mechanisms for those affected to provide feedback, such as an email address for comments.
We hope that the release of GPT-SW3, as well as information around our model training process, will increase open science around both large language models in specific and natural language processing and deep learning in general.

GPT-SW3 Datasheet

We follow the recommendations of Gebru et al. (2021) and provide a datasheet for the dataset used to train GPT-SW3.

Motivation

For what purpose was the dataset created? Was there a specific task in mind? Was there a specific gap that needed to be filled? Please provide a description. Pre-training of Large Language Models (LLM), such as GPT-3 (T. B. Brown et al., 2020), Gopher (J. W. Rae et al., 2022), BLOOM (T. L. Scao et al., 2022), etc. require 100s or even 1000s GBs of text data, with recent studies (Chinchilla: J. Hoffmann et al., 2022) suggesting that the scale of the training data is even more important than previously imagined. Therefore, in order to train Swedish LLMs, we needed a large scale Swedish dataset of high quality. Since no such datasets existed before this initiative, we collected data in the Nordic and English languages.
Who created the dataset (e.g., which team, research group) and on behalf of which entity (e.g., company, institution, organization)? The Strategic Initiative Natural Language Understanding at AI Sweden has established a new research environment in which collaboration is key. The core team working on the creation of the dataset is the NLU research group at AI Sweden. This group consists of researchers and developers from AI Sweden (Lindholmen Science Park AB) and RISE.
Who funded the creation of the dataset? If there is an associated grant, please provide the name of the grantor and the grant name and number. The Swedish Innovation Agency (Vinnova) has funded this work across several different grants, including 2019-02996 and 2022-00949.
Any other comments? No.

Composition

What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)? Are there multiple types of instances (e.g., movies, users, and ratings; people and interactions between them; nodes and edges)? Please provide a description. The instances are textual documents categorized by language and document type. The dataset is a filtered and deduplicated collection that includes the following sources:
Books
- Litteraturbanken (https://litteraturbanken.se/)
- The Pile
Articles
- Diva (https://www.diva-portal.org/)
- The Pile: PubMed
- The Pile: ArXiv
Code
- Code Parrot: Github code (https://huggingface.co/datasets/codeparrot/github-code)
Conversational
- Familjeliv (https://www.familjeliv.se/)
- Flashback (https://flashback.org/)
- Datasets collected through Parlai (see Appendix in data paper for complete list) (https://github.com/facebookresearch/ParlAI)
- Pushshift.io Reddit dataset, developed in Baumgartner et al. (2020) and processed in Roller et al. (2021)
Math
- English Math dataset generated with code from DeepMind (D. Saxton et al., 2019)
- Swedish Math dataset, generated as above with manually translated templates
Miscellaneous
- Summarization data (https://www.ida.liu.se/~arnjo82/papers/clarin-21-julius.pdf)
- OPUS, the open parallel corpus (https://opus.nlpl.eu/)
- Movie scripts (https://github.com/Aveek-Saha/Movie-Script-Database)
- Natural Instructions (https://github.com/allenai/natural-instructions)
- P3 (Public Pool of Prompts), (https://huggingface.co/datasets/bigscience/P3)
- The Norwegian Colossal Corpus (https://huggingface.co/datasets/NbAiLab/NCC)
- Danish Gigaword (https://gigaword.dk/)
- Icelandic Gigaword (https://clarin.is/en/resources/gigaword/)
- The Pile: Stack Exchange
Web Common Crawl
- Web data from the project LES (Linguistic Explorations of Societies, https://les.gu.se).
- Multilingual C4 (MC4), prepared by AllenAI from C4 (C. Raffel et al., 2019)
- Open Super-large Crawled Aggregated coRpus (OSCAR) (P. O. Suarez, 2019)
- The Pile: Open Web Text
Web Sources
- Various public Swedish website scrapes (see Appendix in data paper)
- Familjeliv Articles
- Public Swedish Job Ads from JobTech/Arbetsförmedlingen
- Wikipedia
- Official Wikipedia dumps
Instruction data:
- dolly
- Open Assistant
- OIG
- Fass: Swedish pharmaceutical information, which was transformed into Q&A format.
How many instances are there in total (of each type, if appropriate)? The training data consists of 1.1TB UTF-8 encoded text, containing 660M documents with a total of 320B tokens.
Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set? If the dataset is a sample, then what is the larger set? Is the sample representative of the larger set (e.g., geographic coverage)? If so, please describe how this representativeness was validated/verified. If it is not representative of the larger set, please describe why not (e.g., to cover a more diverse range of instances, because instances were withheld or unavailable). The subset of our dataset that comes from multilingual Common Crawl datasets (MC4, Oscar), are filtered by language to only include Swedish, Norwegian, Danish, and Icelandic. From The Pile, we included only the parts that typically are of highest textual quality or complemented the rest of our dataset with sources we otherwise lacked (e.g. books). The remainder of the dataset was collected from the above sources.
What data does each instance consist of? “Raw” data (e.g., unprocessed text or images) or features? In either case, please provide a description. Each instance consists of raw text data.
Is there a label or target associated with each instance? If so, please provide a description. No.
Is any information missing from individual instances? If so, please provide a description, explaining why this information is missing (e.g., because it was unavailable). This does not include intentionally removed information, but might include, e.g., redacted text. No.
Are relationships between individual instances made explicit (e.g., users’ movie ratings, social network links)? If so, please describe how these relationships are made explicit. There are no explicit relationships between individual instances.
Are there recommended data splits (e.g., training, development/validation, testing)? If so, please provide a description of these splits, explaining the rationale behind them. There are no explicit splits recommended for this dataset. When pre-training the model, a random split for train, dev, test is set to 99.99%, 0.08%, 0.02% respectively, and is sampled proportionally to each subset’s weight and size. The weight of each subset was manually decided beforehand. These decisions were made considering the data’s value, source, and language, to form a representative and balanced pre-training corpus.
Are there any errors, sources of noise, or redundancies in the dataset? If so, please provide a description. The dataset is a collection of many sources, some of which naturally contain some overlap. Although we have performed deduplication, some overlap may still remain. Furthermore, there may be some noise remaining from artifacts originating in Common Crawl datasets, that have been missed by our data filtering process. Except for these, we are not aware of any errors, sources of noise, or redundancies.
Is the dataset self-contained, or does it link to or otherwise rely on external resources (e.g., websites, tweets, other datasets)? The dataset is self-contained.
Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety? If so, please describe why. The dataset contains subsets of public Common Crawl, Reddit, Familjeliv and Flashback. These could contain sentences that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety.
Does the dataset relate to people? If not, you may skip the remaining questions in this section. Some documents of this data relate to people, such as news articles, Wikipedia descriptions, etc.
Does the dataset identify any subpopulations (e.g., by age, gender)? If so, please describe how these subpopulations are identified and provide a description of their respective distributions within the dataset. No, the dataset does not explicitly include subpopulation identification.
Any other comments? No.

Collection Process

How was the data associated with each instance acquired? Was the data directly observable (e.g., raw text, movie ratings), reported by subjects (e.g., survey responses), or indirectly inferred/derived from other data (e.g., part-of-speech tags, model-based guesses for age or language)? If data was reported by subjects or indirectly inferred/derived from other data, was the data validated/verified? If so, please describe how. N/A. The dataset is a union of publicly available datasets and sources.
What mechanisms or procedures were used to collect the data (e.g., hardware apparatus or sensor, manual human curation, software program, software API)? How were these mechanisms or procedures validated? The data was downloaded from the internet.
If the dataset is a sample from a larger set, what was the sampling strategy (e.g., deterministic, probabilistic with specific sampling probabilities)? Please see previous answers for how parts of the dataset were selected.
Who was involved in the data collection process (e.g., students, crowdworkers, contractors) and how were they compensated (e.g., how much were crowdworkers paid)? This data is mined, filtered and sampled by machines.
Over what timeframe was the data collected? Does this timeframe match the creation timeframe of the data associated with the instances (e.g., recent crawl of old news articles)? If not, please describe the timeframe in which the data associated with the instances was created. The dataset was collected during the period June 2021 to June 2022. The creation of the collected sources varies, with e.g. Common Crawl data that have been continuously collected over 12 years.
Does the dataset relate to people? If not, you may skip the remainder of the questions in this section. Yes. The texts have been produced by people. Any personal information potentially present in publicly available data sources and thus in the created dataset is of no interest to the collection and use of the dataset.
Has an analysis of the potential impact of the dataset and its use on data subjects (e.g., a data protection impact analysis) been conducted? If so, please provide a description of this analysis, including the outcomes, as well as a link or other access point to any supporting documentation. Yes.
Any other comments? No.
Preprocessing/cleaning/labeling
Was any preprocessing/cleaning/labeling of the data done (e.g., discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)? If so, please provide a description. If not, you may skip the remainder of the questions in this section. The dataset was filtered and re-formatted on a document-level using standard procedures, inspired by the work in The BigScience ROOTS Corpus (H. Laurençon et al., 2022) and Gopher (J. W. Rae et al., 2022). This was done with the goal of achieving a consistent text format throughout the dataset, and to remove documents that did not meet our textual quality requirements (e.g. repetitiveness). Furthermore, the dataset was deduplicated to remedy the overlap between collected subsets using the MinHash algorithm, similar to the method used in GPT-3 and The Pile, and described in greater detail in “Deduplicating Training Data Makes Language Models Better” (K. Lee et al., 2021).

Instruction data: The processing outlined above was not applied to the instruction data. Instruction data was turned into chat-turn format and formatted accordingly with an end-of-turn token, as well as unrolled into raw textual form. The Open Assistant data was also automatically translated using GPT-SW3 into Swedish, Danish, Norwegian, and Icelandic.
Was the “raw” data saved in addition to the preprocessed/cleaned/labeled data (e.g., to support unanticipated future uses)? If so, please provide a link or other access point to the “raw” data. The “raw” component datasets are publicly available in their respective locations.
Any other comments? No.

Uses

Has the dataset been used for any tasks already? If so, please provide a description. The dataset was used to pre-train the GPT-SW3 models.
Is there a repository that links to any or all papers or systems that use the dataset? If so, please provide a link or other access point. N/A.
What (other) tasks could the dataset be used for? The data can be used to pre-train language models, which are foundations for many current and future language tasks.
Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses? For example, is there anything that a future user might need to know to avoid uses that could result in unfair treatment of individuals or groups (e.g., stereotyping, quality of service issues) or other undesirable harms (e.g., financial harms, legal risks) If so, please provide a description. Is there anything a future user could do to mitigate these undesirable harms? The dataset is probably quite representative of Swedish internet discourse in general, and of the Swedish public sector, but we know that this data does not necessarily reflect the entire Swedish population.
Are there tasks for which the dataset should not be used? If so, please provide a description. None that we are currently aware of.
Any other comments? No.

Distribution

Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created? If so, please provide a description. No.
How will the dataset distributed (e.g., tarball on website, API, GitHub)? Does the dataset have a digital object identifier (DOI)? N/A.
When will the dataset be distributed? N/A.
Will the dataset be distributed under a copyright or other intellectual property (IP) license, and/or under applicable terms of use (ToU)? If so, please describe this license and/or ToU, and provide a link or other access point to, or otherwise reproduce, any relevant licensing terms or ToU, as well as any fees associated with these restrictions. N/A.
Do any export controls or other regulatory restrictions apply to the dataset or to individual instances? If so, please describe these restrictions, and provide a link or other access point to, or otherwise reproduce, any supporting documentation. N/A.
Any other comments? No.

Maintenance

Who is supporting/hosting/maintaining the dataset? AI Sweden at Lindholmen Science Park AB.
How can the owner/curator/manager of the dataset be contacted (e.g., email address)? nlu@ai.se
Is there an erratum? If so, please provide a link or other access point. N/A.
Will the dataset be updated (e.g., to correct labeling errors, add new instances, delete instances)? If so, please describe how often, by whom, and how updates will be communicated to users (e.g., mailing list, GitHub)? Currently, there are no plans for updating the dataset.
If the dataset relates to people, are there applicable limits on the retention of the data associated with the instances (e.g., were individuals in question told that their data would be retained for a fixed period of time and then deleted)? If so, please describe these limits and explain how they will be enforced. Read the privacy policy for the NLU initiative at AI Sweden here.
Will older versions of the dataset continue to be supported/hosted/maintained? If so, please describe how. If not, please describe how its obsolescence will be communicated to users. N/A.
If others want to extend/augment/build on/contribute to the dataset, is there a mechanism for them to do so? If so, please provide a description. Will these contributions be validated/ verified? If so, please describe how. If not, why not? Is there a process for communicating/ distributing these contributions to other users? If so, please provide a description. Not at this time.
Any other comments? No.