|
--- |
|
license: cc-by-nc-sa-4.0 |
|
language: |
|
- ko |
|
- en |
|
metrics: |
|
- bleu |
|
pipeline_tag: text2text-generation |
|
tags: |
|
- nmt |
|
- aihub |
|
--- |
|
|
|
# KOEN-T5-SMALL-V0 |
|
|
|
This model is for Korean to English Machine Translator, which is based on T5-small architecture, but trained from scratch. |
|
|
|
#### Code |
|
|
|
The training code is from my lecture([LLM์ ์ํ ๊น๊ธฐํ์ NLP EXPRESS](https://fastcampus.co.kr/data_online_nlpexpress)), which is published on [FastCampus](https://fastcampus.co.kr/). You can check the training code in this github [repo](https://github.com/kh-kim/nlp-express-practice). |
|
|
|
#### Dataset |
|
|
|
The training dataset for this model is mainly from [AI-Hub](https://www.aihub.or.kr/). The dataset consists of 11M parallel samples. |
|
|
|
#### Tokenizer |
|
|
|
I use Byte-level BPE tokenizer for both source and target language. Since it covers both languages, tokenizer vocab size is 60k. |
|
|
|
#### Architecture |
|
|
|
The model architecture is based on T5-small, which is popular encoder-decoder model architecture. Please, note that this model is trained from-scratch, not fine-tuned. |
|
|
|
#### Evaluation |
|
|
|
I conducted the evaluation with 5 different test sets. Following figure shows BLEU scores on each test set. |
|
|
|
 |
|
|
|
 |
|
|
|
DEEPCL model is private version of this model, which is trained on much more data. |
|
|
|
#### Contact |
|
|
|
Kim Ki Hyun (nlp.with.deep.learning@gmail.com) |
|
|