Helsinki-NLP/opus-mt-tc-big-ko-en · Fix tokenization vocab

Mar 28

•

I had the same issue as in #4. After some investigation, I found that the source tokenization failed, for example:
```
Test Sentence:
다음은 이 문제를 보여주는 간단한 한국어 테스트 문장입니다.

Tokenized through MarianTokenizer:
[0, 11556, 0, 0, 0, 0, 0, 0, 0, 3, 2]

Tokenized through underlying sentencepience model:
[5773, 10, 1138, 5421, 2027, 3780, 1064, 7888, 62, 3]


The tokenizer assigned id=0 (the unknown token) to most tokens, which is unexpected. However, the underlying sentencepiece model (`tokenizer.spm_source.encode(sentence)`) directly gives a different result. This seemed to be caused by [this piece of code](https://github.com/huggingface/transformers/blob/cbe58b4269457a6ca66a556224b23f9ef246f905/src/transformers/models/marian/tokenization_marian.py#L205) which translates the (string) tokens into ids using the vocabulary. However, the tokens in this vocabulary did not correspond to the tokens present in the sentencepiece model, therefore giving mostly 0s / 'unks'.

To fix the issue, i generated the correct vocabularies as follows (inspired by this snippet on [GitHub](https://github.com/google/sentencepiece/issues/668#issuecomment-873527039)):
```python
import transformers
import json

tokenizer = transformers.MarianTokenizer.from_pretrained('./opus-mt-tc-big-ko-en')
vocab = { tokenizer.spm_source.id_to_piece(id): id for id in range(tokenizer.spm_source.get_piece_size()) }
vocab[tokenizer.pad_token] = tokenizer.pad_token_id

with open("opus-mt-tc-big-ko-en/vocab.json", "w") as f:
    json.dump(vocab, f, indent=2)

target_vocab = { tokenizer.spm_target.id_to_piece(id): id for id in range(tokenizer.spm_target.get_piece_size()) }
target_vocab[tokenizer.pad_token] = tokenizer.pad_token_id

with open("opus-mt-tc-big-ko-en/target_vocab.json", "w") as f:
    json.dump(target_vocab, f, indent=2)

and edited the tokenizer config.

This now correctly tokenizes our test sentence:
```

Test Sentence:
다음은 이 문제를 보여주는 간단한 한국어 테스트 문장입니다.

Tokenized through MarianTokenizer:
[5773, 10, 1138, 5421, 2027, 3780, 1064, 7888, 62, 3, 2]

Tokenized through underlying sentencepience model:
[5773, 10, 1138, 5421, 2027, 3780, 1064, 7888, 62, 3]


To make sure this also worked end2end, here's a test I ran:
```python
from transformers import MarianTokenizer, MarianMTModel

sentence = "다음은 이 문제를 보여주는 간단한 한국어 테스트 문장입니다."
tokenizer = MarianTokenizer.from_pretrained('./opus-mt-tc-big-ko-en')
model = MarianMTModel.from_pretrained('./opus-mt-tc-big-ko-en')
translated = model.generate(**tokenizer(sentence, return_tensors="pt", padding=True))

print(tokenizer.decode(translated[0], skip_special_tokens=True))

which prints "ses." in the current, buggy version. The fixed one outputs "The following is a simple Korean test sentence that shows this problem." which is the expected translation . (Note: I used deepl to generate the test sentece, figured that would be enough for this test)

fix: separate source and target vocaba511285f

wosodo1891 changed pull request status to open Mar 28

Fix tokenization vocab

I had the same issue as in #4. After some investigation, I found that the source tokenization failed, for example:```Test Sentence:다음은 이 문제를 보여주는 간단한 한국어 테스트 문장입니다.

Tokenized through MarianTokenizer:[0, 11556, 0, 0, 0, 0, 0, 0, 0, 3, 2]

This now correctly tokenizes our test sentence:```

Test Sentence:다음은 이 문제를 보여주는 간단한 한국어 테스트 문장입니다.

Tokenized through MarianTokenizer:[5773, 10, 1138, 5421, 2027, 3780, 1064, 7888, 62, 3, 2]

I had the same issue as in #4. After some investigation, I found that the source tokenization failed, for example:
```
Test Sentence:
다음은 이 문제를 보여주는 간단한 한국어 테스트 문장입니다.

Tokenized through MarianTokenizer:
[0, 11556, 0, 0, 0, 0, 0, 0, 0, 3, 2]

This now correctly tokenizes our test sentence:
```

Test Sentence:
다음은 이 문제를 보여주는 간단한 한국어 테스트 문장입니다.

Tokenized through MarianTokenizer:
[5773, 10, 1138, 5421, 2027, 3780, 1064, 7888, 62, 3, 2]