BertTokenizer-based Tokenizer that can tokenize Chinese/Cantonese sentences into phrases
Apart from the original 51,271 tokens from the base tokenizer, 194,020 additional Chinese vocabularies are added to this tokenizer.
Usage:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('raptorkwok/wordseg-tokenizer')
Examples:
Cantonese Example 1
tokenizer.tokenize("我哋今日去睇陳奕迅演唱會")
# Output: ['我哋', '今日', '去', '睇', '陳奕迅', '演唱會']
Cantonese Example 2
tokenizer.tokenize("再嘈我打爆你個嘴!")
# Output: ['再', '嘈', '我', '打爆', '你', '個', '嘴', '!']
Chinese Example 1
tokenizer.tokenize("你很肥胖呢,要開始減肥了。")
# Output: ['你', '很', '肥胖', '呢', ',', '要', '開始', '減肥', '了', '。']
Chinese Example 2
tokenizer.tokenize("案件現由大嶼山警區重案組接手調查。")
# Output: ['案件', '現', '由', '大嶼山', '警區', '重案組', '接手', '調查', '。']
Questions?
Please feel free to leave a message in the Community tab.
Model tree for raptorkwok/wordseg-tokenizer
Base model
fnlp/bart-base-chinese