File size: 7,378 Bytes
700ff91 69db0b3 38afa0a e38c3d7 a446daf c9a0e19 700ff91 69db0b3 b7caa6b 9b53c2b 69db0b3 3a9c601 e389b32 3a9c601 69db0b3 9a2d1e9 69db0b3 03d3c70 69db0b3 9a2d1e9 69db0b3 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 |
---
license: bigscience-openrail-m
datasets:
- mc4
language:
- sv
library_name: transformers
inference:
parameters:
top_p: 0.9
repetition_penalty: 1.1
max_new_tokens: 75
do_sample: true
widget:
- text: ":nyheter:"
example_title: "News text"
- text: ":wiki:"
example_title: "Wikipedia text"
- text: ":blogg:"
example_title: "Blog post"
- text: ":forum:"
example_title: "Forum"
- text: ":anons:"
example_title: "Ads"
---
# SweCTRL-Mini
<!-- Provide a quick summary of what the model is/does. -->
SweCTRL-Mini is a large Swedish language model that can be used for inference and fine-tuning on a single consumer-grade GPU. The model is based on the CTRL architecture by Keskar, McCann, Varshney, Xiong, and Socher
(2019), which means that users of the SweCTRL-Mini model can control the genre of the generated text by inserting special tokens in the generation prompts.
Crucially, note that this model is:
- **NOT** trained on following GPT-like instructions,
- **NOT** trained for conversations, like ChatGPT,
- **NOT** trained on any multi-modal data during training. Only one modality -- text, more than 99% of it in Swedish.
**Note on using Inference API (text box to the right):** There are a number of presets that start the text with appropriate control codes to control the genre, e.g., `:wiki:` for
texts form Wikipedia. You can add your own prompt on top of these control codes. For instance, if you want a Wikipedia article about Stockholm, you could write
`:wiki: Stockholm`. The generation in the example is limited to 75 new tokens max. Also, normally the generation should stop after reaching the ending control code,
which has `$` symbol at the end, e.g., `:wiki:$` for Wikipedia texts, however I couldn't configure that here, so please ignore all text after such tokens if they were to be
generated. Additionaly, note, there are **no** filters or other mechanisms for making the text safe from biases or prohibiting it from generating texts on any topics.
## Model Details
### Model Description
<!-- Provide a longer summary of what this model is. -->
- **Developed by:** Dmytro Kalpakchi (with supervision from Johan Boye)
- **Shared by:** Dmytro Kalpakchi
- **Model type:** Transformer-based language model trained by predicting the next token
- **Language(s) (NLP):** Swedish
- **License:** BigScience Open RAIL-M
- **Finetuned from model:** None, trained from scratch
### Model Sources
<!-- Provide the basic links for the model. -->
- **Website:** https://swectrl.dev/
- **Repository:** https://github.com/dkalpakchi/SweCTRL-Mini
- **Paper:** https://arxiv.org/pdf/2304.13994.pdf
- **Technical note:** https://zenodo.org/record/7868205
## Uses
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
### Direct Use
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
The model should be used for generating texts of various genres in Swedish.
### Out-of-Scope Use
<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
Please refer to Appendix A of the License file for information of use restrictions. The model has a limited context window of 256 tokens, so it will most probably not work well
for text summarization. Additionally, vast majority of its training data was in Swedish, although it contains tokens in other languages as well, so tasks like
Machine Translation would require further fine-tuning.
## Bias, Risks, and Limitations
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
To mitigate the inclusion of personally-identifiable data we attempted to remove sources that could contain such data to the best of our ability (see Technical note for
more details on the data filtering process). However, we have still noted that the model can generate text that includes various forms of biases, which is why we strongly
recommend human curation of the generated texts. Currently we have conducted no systematic investigation on either the kinds of biases are included in the generated texts or how
frequently they occur. The contribution of the community on this matter would be very welcome.
### Recommendations
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
For further recommendations on the use of the model, please see the associated paper.
## How to Get Started with the Model
The fastest way to start with the model is using the code below:
```py
from transformers import pipeline
pipe = pipeline(model="dkalpakchi/SweCTRL-Mini")
print(pipe(":nyheter:", max_length=256, repetition_penalty=1.1, top_p=0.9))
```
For more advanced uses and other code examples, please see the associated GitHub repository (https://github.com/dkalpakchi/SweCTRL-Mini).
## Training Details
### Training Data
<!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
The training data includes the *subset* of cleaned Swedish mC4, as well as some documents from Project Runeberg.
The extensive information on the training data is provided in the Section 1 of the Technical note.
The interface to partially mine training data is available at: https://swectrl.dev/data
### Training Procedure
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
#### Preprocessing [optional]
See Section 1 of the Technical note.
#### Training Hyperparameters
- **Training regime:** fp32
## Evaluation
See Sections 5.3, 6, and 7 in the associated paper, and Section 3 of the Technical note.
## Environmental Impact
<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
- **Hardware Type:** 8 A100 GPUs
- **Hours used:** 11907.6 GPU-hours for training and experimentation
- **Provider:** BerzeLiUs supercomputer
- **Carbon Emitted:** No public data on carbon efficiency, so hard to estimate
## Technical Specifications
See Section 3 of the associated paper
## Citation
**BibTeX:**
```bibtex
@article{kalpakchi2023swectrl,
title={SweCTRL-Mini: a data-transparent Transformer-based large language model for controllable text generation in Swedish},
author={Kalpakchi, Dmytro and Boye, Johan},
journal={arXiv preprint arXiv:2304.13994},
year={2023}
}
```
**APA:**
Kalpakchi, D., & Boye, J. (2023). SweCTRL-Mini: a data-transparent Transformer-based large language model for controllable text generation in Swedish. arXiv preprint arXiv:2304.13994.
## Model Card Authors
Dmytro Kalpakchi (dmytroka@kth.se)
## Model Card Contact
Dmytro Kalpakchi (dmytroka@kth.se)
# References
Keskar, N. S., McCann, B., Varshney, L. R., Xiong, C., & Socher, R. (2019). Ctrl: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858. |