|
Metadata-Version: 2.1 |
|
Name: tokenizers |
|
Version: 0.15.2 |
|
Classifier: Development Status :: 5 - Production/Stable |
|
Classifier: Intended Audience :: Developers |
|
Classifier: Intended Audience :: Education |
|
Classifier: Intended Audience :: Science/Research |
|
Classifier: License :: OSI Approved :: Apache Software License |
|
Classifier: Operating System :: OS Independent |
|
Classifier: Programming Language :: Python :: 3 |
|
Classifier: Programming Language :: Python :: 3.7 |
|
Classifier: Programming Language :: Python :: 3.8 |
|
Classifier: Programming Language :: Python :: 3.9 |
|
Classifier: Programming Language :: Python :: 3.10 |
|
Classifier: Programming Language :: Python :: 3.11 |
|
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence |
|
Requires-Dist: huggingface_hub >=0.16.4, <1.0 |
|
Requires-Dist: pytest ; extra == 'testing' |
|
Requires-Dist: requests ; extra == 'testing' |
|
Requires-Dist: numpy ; extra == 'testing' |
|
Requires-Dist: datasets ; extra == 'testing' |
|
Requires-Dist: black ==22.3 ; extra == 'testing' |
|
Requires-Dist: sphinx ; extra == 'docs' |
|
Requires-Dist: sphinx_rtd_theme ; extra == 'docs' |
|
Requires-Dist: setuptools_rust ; extra == 'docs' |
|
Requires-Dist: tokenizers[testing] ; extra == 'dev' |
|
Provides-Extra: testing |
|
Provides-Extra: docs |
|
Provides-Extra: dev |
|
Keywords: NLP,tokenizer,BPE,transformer,deep learning |
|
Author: Anthony MOI <m.anthony.moi@gmail.com> |
|
Author-email: Nicolas Patry <patry.nicolas@protonmail.com>, Anthony Moi <anthony@huggingface.co> |
|
Requires-Python: >=3.7 |
|
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM |
|
Project-URL: Homepage, https://github.com/huggingface/tokenizers |
|
Project-URL: Source, https://github.com/huggingface/tokenizers |
|
|
|
<p align="center"> |
|
<br> |
|
<img src="https://huggingface.co/landing/assets/tokenizers/tokenizers-logo.png" width="600"/> |
|
<br> |
|
<p> |
|
<p align="center"> |
|
<a href="https://badge.fury.io/py/tokenizers"> |
|
<img alt="Build" src="https://badge.fury.io/py/tokenizers.svg"> |
|
</a> |
|
<a href="https://github.com/huggingface/tokenizers/blob/master/LICENSE"> |
|
<img alt="GitHub" src="https://img.shields.io/github/license/huggingface/tokenizers.svg?color=blue"> |
|
</a> |
|
</p> |
|
<br> |
|
|
|
|
|
|
|
Provides an implementation of today's most used tokenizers, with a focus on performance and |
|
versatility. |
|
|
|
Bindings over the [Rust](https://github.com/huggingface/tokenizers/tree/master/tokenizers) implementation. |
|
If you are interested in the High-level design, you can go check it there. |
|
|
|
Otherwise, let's dive in! |
|
|
|
|
|
|
|
- Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3 |
|
most common BPE versions). |
|
- Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes |
|
less than 20 seconds to tokenize a GB of text on a server's CPU. |
|
- Easy to use, but also extremely versatile. |
|
- Designed for research and production. |
|
- Normalization comes with alignments tracking. It's always possible to get the part of the |
|
original sentence that corresponds to a given token. |
|
- Does all the pre-processing: Truncate, Pad, add the special tokens your model needs. |
|
|
|
|
|
|
|
|
|
|
|
```bash |
|
pip install tokenizers |
|
``` |
|
|
|
|
|
|
|
To use this method, you need to have the Rust installed: |
|
|
|
```bash |
|
|
|
curl https://sh.rustup.rs -sSf | sh -s -- -y |
|
export PATH="$HOME/.cargo/bin:$PATH" |
|
``` |
|
|
|
Once Rust is installed, you can compile doing the following |
|
|
|
```bash |
|
git clone https://github.com/huggingface/tokenizers |
|
cd tokenizers/bindings/python |
|
|
|
|
|
python -m venv .env |
|
source .env/bin/activate |
|
|
|
|
|
pip install -e . |
|
``` |
|
|
|
|
|
|
|
```python |
|
from tokenizers import Tokenizer |
|
|
|
tokenizer = Tokenizer.from_pretrained("bert-base-cased") |
|
``` |
|
|
|
|
|
|
|
We provide some pre-build tokenizers to cover the most common cases. You can easily load one of |
|
these using some `vocab.json` and `merges.txt` files: |
|
|
|
```python |
|
from tokenizers import CharBPETokenizer |
|
|
|
|
|
vocab = "./path/to/vocab.json" |
|
merges = "./path/to/merges.txt" |
|
tokenizer = CharBPETokenizer(vocab, merges) |
|
|
|
|
|
encoded = tokenizer.encode("I can feel the magic, can you?") |
|
print(encoded.ids) |
|
print(encoded.tokens) |
|
``` |
|
|
|
And you can train them just as simply: |
|
|
|
```python |
|
from tokenizers import CharBPETokenizer |
|
|
|
|
|
tokenizer = CharBPETokenizer() |
|
|
|
|
|
tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ]) |
|
|
|
|
|
encoded = tokenizer.encode("I can feel the magic, can you?") |
|
|
|
|
|
tokenizer.save("./path/to/directory/my-bpe.tokenizer.json") |
|
``` |
|
|
|
|
|
|
|
- `CharBPETokenizer`: The original BPE |
|
- `ByteLevelBPETokenizer`: The byte level version of the BPE |
|
- `SentencePieceBPETokenizer`: A BPE implementation compatible with the one used by SentencePiece |
|
- `BertWordPieceTokenizer`: The famous Bert tokenizer, using WordPiece |
|
|
|
All of these can be used and trained as explained above! |
|
|
|
|
|
|
|
Whenever these provided tokenizers don't give you enough freedom, you can build your own tokenizer, |
|
by putting all the different parts you need together. |
|
You can check how we implemented the [provided tokenizers](https://github.com/huggingface/tokenizers/tree/master/bindings/python/py_src/tokenizers/implementations) and adapt them easily to your own needs. |
|
|
|
|
|
|
|
Here is an example showing how to build your own byte-level BPE by putting all the different pieces |
|
together, and then saving it to a single file: |
|
|
|
```python |
|
from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, processors |
|
|
|
|
|
tokenizer = Tokenizer(models.BPE()) |
|
|
|
|
|
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True) |
|
tokenizer.decoder = decoders.ByteLevel() |
|
tokenizer.post_processor = processors.ByteLevel(trim_offsets=True) |
|
|
|
|
|
trainer = trainers.BpeTrainer( |
|
vocab_size=20000, |
|
min_frequency=2, |
|
initial_alphabet=pre_tokenizers.ByteLevel.alphabet() |
|
) |
|
tokenizer.train([ |
|
"./path/to/dataset/1.txt", |
|
"./path/to/dataset/2.txt", |
|
"./path/to/dataset/3.txt" |
|
], trainer=trainer) |
|
|
|
|
|
tokenizer.save("byte-level-bpe.tokenizer.json", pretty=True) |
|
``` |
|
|
|
Now, when you want to use this tokenizer, this is as simple as: |
|
|
|
```python |
|
from tokenizers import Tokenizer |
|
|
|
tokenizer = Tokenizer.from_file("byte-level-bpe.tokenizer.json") |
|
|
|
encoded = tokenizer.encode("I can feel the magic, can you?") |
|
``` |
|
|
|
|