I had the same issue as in #4. After some investigation, I found that the source tokenization failed, for example:
```
Test Sentence:
๋‹ค์Œ์€ ์ด ๋ฌธ์ œ๋ฅผ ๋ณด์—ฌ์ฃผ๋Š” ๊ฐ„๋‹จํ•œ ํ•œ๊ตญ์–ด ํ…Œ์ŠคํŠธ ๋ฌธ์žฅ์ž…๋‹ˆ๋‹ค.

Tokenized through MarianTokenizer:
[0, 11556, 0, 0, 0, 0, 0, 0, 0, 3, 2]

Tokenized through underlying sentencepience model:
[5773, 10, 1138, 5421, 2027, 3780, 1064, 7888, 62, 3]


The tokenizer assigned id=0 (the unknown token) to most tokens, which is unexpected. However, the underlying sentencepiece model (`tokenizer.spm_source.encode(sentence)`) directly gives a different result. This seemed to be caused by [this piece of code](https://github.com/huggingface/transformers/blob/cbe58b4269457a6ca66a556224b23f9ef246f905/src/transformers/models/marian/tokenization_marian.py#L205) which translates the (string) tokens into ids using the vocabulary. However, the tokens in this vocabulary did not correspond to the tokens present in the sentencepiece model, therefore giving mostly 0s / 'unks'.

To fix the issue, i generated the correct vocabularies as follows (inspired by this snippet on [GitHub](https://github.com/google/sentencepiece/issues/668#issuecomment-873527039)):
```python
import transformers
import json

tokenizer = transformers.MarianTokenizer.from_pretrained('./opus-mt-tc-big-ko-en')
vocab = { tokenizer.spm_source.id_to_piece(id): id for id in range(tokenizer.spm_source.get_piece_size()) }
vocab[tokenizer.pad_token] = tokenizer.pad_token_id

with open("opus-mt-tc-big-ko-en/vocab.json", "w") as f:
    json.dump(vocab, f, indent=2)

target_vocab = { tokenizer.spm_target.id_to_piece(id): id for id in range(tokenizer.spm_target.get_piece_size()) }
target_vocab[tokenizer.pad_token] = tokenizer.pad_token_id

with open("opus-mt-tc-big-ko-en/target_vocab.json", "w") as f:
    json.dump(target_vocab, f, indent=2)

and edited the tokenizer config.

This now correctly tokenizes our test sentence:
```

Test Sentence:
๋‹ค์Œ์€ ์ด ๋ฌธ์ œ๋ฅผ ๋ณด์—ฌ์ฃผ๋Š” ๊ฐ„๋‹จํ•œ ํ•œ๊ตญ์–ด ํ…Œ์ŠคํŠธ ๋ฌธ์žฅ์ž…๋‹ˆ๋‹ค.

Tokenized through MarianTokenizer:
[5773, 10, 1138, 5421, 2027, 3780, 1064, 7888, 62, 3, 2]

Tokenized through underlying sentencepience model:
[5773, 10, 1138, 5421, 2027, 3780, 1064, 7888, 62, 3]


To make sure this also worked end2end, here's a test I ran:
```python
from transformers import MarianTokenizer, MarianMTModel

sentence = "๋‹ค์Œ์€ ์ด ๋ฌธ์ œ๋ฅผ ๋ณด์—ฌ์ฃผ๋Š” ๊ฐ„๋‹จํ•œ ํ•œ๊ตญ์–ด ํ…Œ์ŠคํŠธ ๋ฌธ์žฅ์ž…๋‹ˆ๋‹ค."
tokenizer = MarianTokenizer.from_pretrained('./opus-mt-tc-big-ko-en')
model = MarianMTModel.from_pretrained('./opus-mt-tc-big-ko-en')
translated = model.generate(**tokenizer(sentence, return_tensors="pt", padding=True))

print(tokenizer.decode(translated[0], skip_special_tokens=True))

which prints "ses." in the current, buggy version. The fixed one outputs "The following is a simple Korean test sentence that shows this problem." which is the expected translation . (Note: I used deepl to generate the test sentece, figured that would be enough for this test)

wosodo1891 changed pull request status to open
Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment