Different encoding for identical looking words (for East Asian Languages)

#2
by tavanaei - opened

If you do:
tokenizer(['기하학에 대한 노트', '기하학에 대한 노트'])

The output will be:
{'input_ids': [[101, 12133, 80207, 50013, 15492, 17534, 12265, 22903, 102], [101, 100, 100, 100, 102]] ...}

While the two sentences look exactly the same (and Google Translates them the same). BTW, 'bert-base-uncased' or other models don't have this issue.
What can be the problem? The initial encoding of these characters?

Sorry for the delayed reply here.

Best that I can tell, there's an encoding difference that's causing the difference. Looking at the second tokenized sequence ([101, 100, 100, 100, 102]), the 100s in there match <UNK>, which to me indicates that it wasn't found in the vocab, which further hints at an encoding issue.

setu4993 changed discussion status to closed

@tavanaei there seems to be some difference in the text 기하학에 대한 노트 (1st String) and 기하학에 대한 노트 (2nd String) maybe this is not visible to the human eye but when I copy-pasted the tokenizer(['기하학에 대한 노트', '기하학에 대한 노트']) from your initial question and perform a simple string match as attached in the Screenshot, they came out different. This is the reason why you're getting different encoding values.
Screenshot 2023-10-09 at 12.26.07 PM.png

@DeathReaper0965 : Nice validation, thanks for checking!

Sign up or log in to comment