|
--- |
|
license: mit |
|
language: |
|
- ja |
|
- ko |
|
pipeline_tag: translation |
|
inference: false |
|
--- |
|
|
|
# Japanese to Korean translator |
|
|
|
Japanese to Korean translator model based on [EncoderDecoderModel](https://huggingface.co/docs/transformers/model_doc/encoder-decoder)([bert-japanese](https://huggingface.co/cl-tohoku/bert-base-japanese)+[kogpt2](https://github.com/SKT-AI/KoGPT2)) |
|
|
|
# Usage |
|
## Demo |
|
Please visit https://huggingface.co/spaces/sappho192/aihub-ja-ko-translator-demo |
|
|
|
## Dependencies (PyPI) |
|
|
|
- torch |
|
- transformers |
|
- fugashi |
|
- unidic-lite |
|
|
|
## Inference |
|
|
|
```Python |
|
from transformers import( |
|
EncoderDecoderModel, |
|
PreTrainedTokenizerFast, |
|
BertJapaneseTokenizer, |
|
) |
|
|
|
import torch |
|
|
|
encoder_model_name = "cl-tohoku/bert-base-japanese-v2" |
|
decoder_model_name = "skt/kogpt2-base-v2" |
|
|
|
src_tokenizer = BertJapaneseTokenizer.from_pretrained(encoder_model_name) |
|
trg_tokenizer = PreTrainedTokenizerFast.from_pretrained(decoder_model_name) |
|
|
|
model = EncoderDecoderModel.from_pretrained("sappho192/aihub-ja-ko-translator") |
|
|
|
text = "εγγΎγγ¦γγγγγγι‘γγγΎγγ" |
|
|
|
def translate(text_src): |
|
embeddings = src_tokenizer(text_src, return_attention_mask=False, return_token_type_ids=False, return_tensors='pt') |
|
embeddings = {k: v for k, v in embeddings.items()} |
|
output = model.generate(**embeddings, max_length=500)[0, 1:-1] |
|
text_trg = trg_tokenizer.decode(output.cpu()) |
|
return text_trg |
|
|
|
print(translate(text)) |
|
``` |
|
|
|
# Dataset |
|
|
|
This model used datasets from 'The Open AI Dataset Project (AI-Hub, South Korea)'. |
|
All data information can be accessed through 'AI-Hub ([aihub.or.kr](https://www.aihub.or.kr))'. |
|
(**In order for a corporation, organization, or individual located outside of Korea to use AI data, etc., a separate agreement is required** with the performing organization and the Korea National Information Society agency(NIA). In order to export AI data, etc. outside the country, a separate agreement is required with the performing organization and the NIA. [Link](https://aihub.or.kr/intrcn/guid/usagepolicy.do?currMenu=151&topMenu=105)) |
|
|
|
μ΄ λͺ¨λΈμ κ³ΌνκΈ°μ μ 보ν΅μ λΆμ μ¬μμΌλ‘ νκ΅μ§λ₯μ 보μ¬νμ§ν₯μμ μ§μμ λ°μ ꡬμΆλ λ°μ΄ν°μ
μ νμ©νμ¬ μνλ μ°κ΅¬μ
λλ€. |
|
λ³Έ λͺ¨λΈμ νμ©λ λ°μ΄ν°λ AI νλΈ([aihub.or.kr](https://www.aihub.or.kr))μμ λ€μ΄λ‘λ λ°μΌμ€ μ μμ΅λλ€. |
|
(**κ΅μΈμ μμ¬νλ λ²μΈ, λ¨μ²΄ λλ κ°μΈμ΄ AIλ°μ΄ν° λ±μ μ΄μ©νκΈ° μν΄μλ** μνκΈ°κ΄ λ± λ° νκ΅μ§λ₯μ 보μ¬νμ§ν₯μκ³Ό λ³λλ‘ ν©μκ° νμν©λλ€. |
|
**λ³Έ AIλ°μ΄ν° λ±μ κ΅μΈ λ°μΆμ μν΄μλ** μνκΈ°κ΄ λ± λ° νκ΅μ§λ₯μ 보μ¬νμ§ν₯μκ³Ό λ³λλ‘ ν©μκ° νμν©λλ€. [[μΆμ²](https://aihub.or.kr/intrcn/guid/usagepolicy.do?currMenu=151&topMenu=105)]) |
|
|
|
## Dataset list |
|
|
|
The dataset used to train the model is merged following sub-datasets: |
|
|
|
- 027. μΌμμν λ° κ΅¬μ΄μ²΄ ν-μ€, ν-μΌ λ²μ λ³λ ¬ λ§λμΉ λ°μ΄ν° [[Link](https://aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&dataSetSn=546)] |
|
- 053. νκ΅μ΄-λ€κ΅μ΄(μμ΄ μ μΈ) λ²μ λ§λμΉ(κΈ°μ κ³Όν) [[Link](https://aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&dataSetSn=71493)] |
|
- 054. νκ΅μ΄-λ€κ΅μ΄ λ²μ λ§λμΉ(κΈ°μ΄κ³Όν) [[Link](https://aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&dataSetSn=71496)] |
|
- 055. νκ΅μ΄-λ€κ΅μ΄ λ²μ λ§λμΉ (μΈλ¬Έν) [[Link](https://aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&dataSetSn=71498)] |
|
- νκ΅μ΄-μΌλ³Έμ΄ λ²μ λ§λμΉ [[Link](https://aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&dataSetSn=127)] |
|
|
|
To reproduce the the merged dataset, you can use the code in below link: |
|
https://github.com/sappho192/aihub-translation-dataset |
|
|
|
|