Transformers
Inference Endpoints
File size: 2,512 Bytes
7d3a237
 
 
 
ed43313
 
 
 
7d3a237
 
 
 
 
 
 
4ba6318
 
7d3a237
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
---
license: cc-by-nc-sa-4.0
language:
- ga
- sga
- mga
- ghc
- la
library_name: transformers
---

**Historical Irish SentencePiece tokenizer** was trained on Old, Middle, Early Modern, Classical Modern and pre-reform Modern Irish texts from St. Gall Glosses, Würzburg Glosses, CELT and the book subcorpus Historical Irish Corpus. The training data spans ca. 550 — 1926 and covers a wide variety of genres, such as bardic poetry, native Irish stories, translations and adaptations of continental epic and romance, annals, genealogies, grammatical and medical tracts, diaries, and religious writing. Due to code-switching in some texts, the model has some Latin in the vocabulary.

[SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing (Kudo et al., 2018)](https://arxiv.org/pdf/1808.06226.pdf) treats the input as a raw input stream, thus including the space in the set of characters to use. It then uses the BPE or unigram algorithm to construct the appropriate vocabulary. It helps process languages that don't separate words. All transformer models in the `transformers` library that use SentencePiece use it in combination with unigram. Examples of models using SentencePiece are [ALBERT](https://huggingface.co/docs/transformers/en/model_doc/albert), [XLNet](https://huggingface.co/docs/transformers/en/model_doc/xlnet), [Marian](https://huggingface.co/docs/transformers/en/model_doc/marian), and [T5](https://huggingface.co/docs/transformers/en/model_doc/t5).

This tokenizer was trained with `vocab_size=25000` and `min_frequency=2`.

### Use

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("ancatmara/historical-irish-tokenizer-sentencepiece")
texts = ['Boí Óengus in n-aidchi n-aili inna chotlud.', 'Co n-accae ní, in n-ingin cucci for crunn síuil dó.']

tokenizer(texts, max_length=128, truncation=True)
```

Out:

```python
>>> {'input_ids': [[0, 16082, 2910, 213, 8040, 13888, 1937, 6875, 343, 3455, 2], [0, 1785, 6693, 1783, 13014, 213, 14883, 739, 12985, 279, 458, 1049, 602, 358, 1782, 2]],
    'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
    'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
```

```python
tokenizer.decode([0, 16082, 2910, 213, 8040, 13888, 1937, 6875, 343, 3455, 2])
```

Out:

```python
>>> '<s> Boí Óengus in n-aidchi n-aili inna chotlud.</s>'
```