|
--- |
|
library_name: transformers |
|
license: apache-2.0 |
|
--- |
|
|
|
# claude3 tokenizer |
|
|
|
|
|
for autoregressive/causal |
|
|
|
|
|
```python |
|
from transformers import AutoTokenizer |
|
tk = AutoTokenizer.from_pretrained("BEE-spoke-data/claude-tokenizer") |
|
tk |
|
``` |
|
|
|
``` |
|
GPT2TokenizerFast(name_or_path='BEE-spoke-data/claude-tokenizer', vocab_size=65000, model_max_length=200000, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<EOT>', 'eos_token': '<EOT>', 'unk_token': '<EOT>'}, clean_up_tokenization_spaces=True), added_tokens_decoder={ |
|
0: AddedToken("<EOT>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), |
|
1: AddedToken("<META>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), |
|
2: AddedToken("<META_START>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), |
|
3: AddedToken("<META_END>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), |
|
4: AddedToken("<SOS>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), |
|
} |
|
|
|
|
|
In [4]: tk.eos_token_id |
|
Out[4]: 0 |
|
|
|
In [5]: tk.pad_token_id |
|
|
|
In [6]: tk.unk_token_id |
|
Out[6]: 0 |
|
|
|
In [7]: tk.bos_token_id |
|
Out[7]: 0 |
|
``` |
|
|