NLTK Tokenizer for Transformers 🤗

📖 Overview

The NLTK Tokenizer is a custom tokenizer class designed for use with the Hugging Face Transformers library. This tokenizer leverage the NlktTokenizer class extends the PreTrainedTokenizer from the Hugging Face's Transformers library to create a NLTK-based tokenizer. This approach combines the robust pre-training and easy integration features of the PreTrainedTokenizer with the linguistic processing strengths of NLTK's word_tokenize. The result is a tokenizer that is both powerful in handling diverse language patterns and compatible with advanced NLP modeling techniques.

🛠️ Installation

To use the NLTK Tokenizer, ensure you have both transformers and nltk libraries installed. You can install them using:

With pip

pip install transformers nltk

With Conda

conda install -c huggingface transformers nltk

🚴‍♂️ Getting Started

Initializing the Tokenizer

Clone this repo
Go to the directory where you cloned this repo
Initialize the NLTK Tokenizer with a vocabulary file. Note that your vocab file should list one token per lines:

from tokenization_nltk import NlktTokenizer

tokenizer = NlktTokenizer(vocab_file='path/to/your/vocabulary.txt') #vocab.txt

Enjoy 🤗

🔬 Basic Usage Examples

Simple Tokenization:

text = "Hello Shirin,  How are you?"
tokens = tokenizer.tokenize(text)
print("Tokens:", tokens) #ouput: Tokens: ['Hello', 'Shirin', ',', 'How', 'are', 'you', '?']

Including Special Tokens:

text = "<s>Hello, world!<end_of_text>"
tokens = tokenizer.tokenize(text, add_special_tokens=True)
print(tokens) #output: ['<s>', 'Hello', ',', 'world', '!', '<end_of_text>']

Token-ID Conversion:

tokens = ['the', 'weather', 'IS', 'Sunny', '!']
token_ids = [tokenizer.convert_tokens_to_ids(token.lower()) for token in tokens]#lower() because the vocab.txt is all in lower case for us
print(token_ids) #output: [1997, 4634, 2004, 11560, 1000]

ID-Token Conversion:

ids = [1, 24707, 4634, 19238, 1000, 31000]
tokens = [tokenizer.convert_ids_to_tokens(id) for id in ids]
print(tokens) #output: ['[PAD]', 'cloudy', 'weather', 'sucks', '!', '[UNK]']

Tokenizing a Long Text:

long_text = "In a hole in the ground there lived a hobbit. Not a nasty, dirty, wet hole, filled with the ends of worms and an oozy smell, nor yet a dry, bare, sandy hole with nothing in it to sit down on or to eat: it was a hobbit-hole, and that means comfort."
long_tokens = tokenizer.tokenize(long_text)
print("Tokens:", long_tokens) #output: ['In', 'a', 'hole', 'in', 'the', 'ground', 'there', 'lived', 'a', 'hobbit', '.', 'Not', 'a', 'nasty', ',', 'dirty', ',', 'wet', 'hole', ',', 'filled', 'with', 'the', 'ends', 'of', 'worms', 'and', 'an', 'oozy', 'smell', ',', 'nor', 'yet', 'a', 'dry', ',', 'bare', ',', 'sandy', 'hole', 'with', 'nothing', 'in', 'it', 'to', 'sit', 'down', 'on', 'or', 'to', 'eat', ':', 'it', 'was', 'a', 'hobbit-hole', ',', 'and', 'that', 'means', 'comfort', '.']

Tokenizing Sentences with Emojis:

text_with_emoji = "I love pizza 🍕! Do you like it too?"
tokens_with_emoji = tokenizer.tokenize(text_with_emoji)
print("Tokens:", tokens_with_emoji) #output: ['I', 'love', 'pizza', '🍕', '!', 'Do', 'you', 'like', 'it', 'too', '?']

Saving the Tokenizer:

Save the tokenizer's state, including its vocabulary:

tokenizer.save_vocabulary(save_directory='path/to/save')

🧪 Evaluation using `Pytest`

We have comprehensively tested our tokenizer by implementing various test cases using pytest, ensuring its robustness and functionality across different input scenarios. Make sure to try it yourself by:

pytest test_tokenization_nltk.py

⚠️ Limitations:

Contextual understanding: Biggest concern with NLTK's tokenization is that it operates mainly at the word level! This means it does not capture nuanced tokenization decisions needed for some NLP tasks that require sub-word or character-level understanding!
Language Complexity: NLTK might struggle with tokenizing languages with complex morphologies or those requiring specialized tokenization rules. For instance, it might struggle to handle languages that heavily rely on context, like some forms of Chinese or Japanese.
Out-of-Vocabulary Words: If the tokenizer encounters words not present in its vocabulary (like 31000 id in the last example), it might use an [UNK] (unknown) token or handle them poorly, affecting downstream tasks' performance.
Limited Preprocessing Performance: It does not support fully the emojis.

🤗 Hub Integration

Make sure you have your vocabularary file (vocab.txt) in the same directory where you have the project.

Simple Tokenization:

import torch
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("ShirinYamani/task", trust_remote_code = True)
text = "Example sentence for tokenization."
# Tokenize the text
tokens = tokenizer.tokenize(text)
print("Tokens:", tokens)

Including Special Tokens:

text = "<s>Hello, world!<end_of_text>"
tokens = tokenizer.tokenize(text, add_special_tokens=True)
print(tokens) #output: ['<s>', 'Hello', ',', 'world', '!', '<end_of_text>']

Token-ID Conversion:

tokens = ['the', 'weather', 'IS', 'Sunny', '!']
token_ids = [tokenizer.convert_tokens_to_ids(token.lower()) for token in tokens]#lower() because the vocab.txt is all in lower case for us
print(token_ids) #output: [1997, 4634, 2004, 11560, 1000]

ShirinYamani
/

NLTK-tokenizer

You need to agree to share your contact information to access this model

NLTK Tokenizer for Transformers 🤗

📖 Overview

🛠️ Installation

With pip

With Conda

🚴‍♂️ Getting Started

Initializing the Tokenizer

🔬 Basic Usage Examples

🧪 Evaluation using `Pytest`

⚠️ Limitations:

🤗 Hub Integration

You need to agree to share your contact information to access this model

NLTK Tokenizer for Transformers 🤗

📖 Overview

🛠️ Installation

With pip

With Conda

🚴‍♂️ Getting Started

Initializing the Tokenizer

🔬 Basic Usage Examples

🧪 Evaluation using Pytest

⚠️ Limitations:

🤗 Hub Integration

🧪 Evaluation using `Pytest`