|
--- |
|
language: |
|
- en |
|
- ja |
|
tags: |
|
- nllb |
|
license: cc-by-nc-4.0 |
|
--- |
|
|
|
# NLLB 1.3B fine-tuned on Japanese to English Light Novel translation |
|
|
|
This model was fine-tuned on light and web novel for Japanese to English translation. |
|
|
|
It can translate sentences and paragraphs up to 512 tokens. |
|
|
|
|
|
## Usage |
|
```python |
|
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("thefrigidliquidation/nllb-jaen-1.3B-lightnovels") |
|
model = AutoModelForSeq2SeqLM.from_pretrained("thefrigidliquidation/nllb-jaen-1.3B-lightnovels") |
|
|
|
generated_tokens = model.generate( |
|
**inputs, |
|
forced_bos_token_id=tokenizer.lang_code_to_id[tokenizer.tgt_lang], |
|
max_new_tokens=1024, |
|
no_repeat_ngram_size=6, |
|
).cpu() |
|
|
|
translated_text = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0] |
|
``` |
|
|
|
Generating with diverse beam search seems to work best. Add the following to `model.generate`: |
|
```python |
|
num_beams=8, |
|
num_beam_groups=4, |
|
do_sample=False, |
|
``` |
|
|
|
|
|
## Glossary |
|
You can provide up to 10 custom translations for nouns and character names at runtime. To do so, surround the Japanese term with term tokens. Prefix the word with one of `<t0>, <t1>, ..., <t9>` and suffix the word with `</t>`. The term will be translated as the prefix term token which can then be string replaced. |
|
|
|
For example, in `γγ€γ³γγ«γγγθΏγγ«ζ₯γγ` if you wish to have `γγ€γ³` translated as `Myne` you would replace `γγ€γ³` with `<t0>γγ€γ³</t>`. The model will translate `<t0>γγ€γ³</t>γγ«γγγθΏγγ«ζ₯γγ` as `<t0>, Lutz is here to pick you up.` Then simply do a string replacement on the output, replacing `<t0>` with `Myne`. |
|
|
|
|
|
## Honorifics |
|
You can force the model to generate or ignore honorifics. |
|
|
|
```python |
|
# default, the model decides whether to use honorifics |
|
tokenizer.tgt_lang = "jpn_Jpan" |
|
# no honorifics, the model is discouraged from using honorifics |
|
tokenizer.tgt_lang = "zsm_Latn" |
|
# honorifics, the model is encouraged to use honorifics |
|
tokenizer.tgt_lang = "zul_Latn" |
|
``` |
|
|