Icebreaker tokenizer

This is a BPE tokenizer trained on the Iceladic Gigaword Corpus, News 1. The tokenizer can be used for training Icelandic language models.

Model Details

BPE tokenizer, trained on the first 242553 files in the News 1 IGC 2022, unnanotated dataset by Arnastofnun.

Model Description

It has a vocab size of 3200.

  • Developed by: Sigurdur Haukur Birgisson
  • Model type: GPT2Tokenizer
  • Language(s) (NLP): Icelandic

Model Sources

How to Get Started with the Model

Use the code below to get started with the model.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Sigurdur/icebreaker")
tokens = tokenizer("Halló heimur!")

Model Card Contact

Sigurdur Haukur Birgissson: haukurbirgisson5@gmail.com

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no pipeline_tag.

Collection including Sigurdur/icebreaker-tokenicer