File size: 3,221 Bytes
d95ea53 e315939 d95ea53 bff6a1d d95ea53 a0a8e19 17f0170 edb3e69 17f0170 d95ea53 1c4b1e3 d95ea53 06b7708 93f5e7f 06b7708 3ae9bc6 1c4b1e3 d95ea53 166d7c3 e708c8f d95ea53 e315939 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 |
---
datasets:
- PrompTart/PTT_advanced_en_ko
language:
- en
- ko
base_model:
- facebook/m2m100_418M
library_name: transformers
---
# M2M100 Fine-Tuned on Parenthetical Terminology Translation (PTT) Dataset
## Model Overview
This is a **M2M100** model fine-tuned on the [**Parenthetical Terminology Translation (PTT)**](https://arxiv.org/abs/2410.00683) dataset. [The PTT dataset](https://huggingface.co/datasets/PrompTart/PTT_advanced_en_ko) focuses on translating technical terms accurately by placing the original English term in parentheses alongside its Korean translation, enhancing clarity and precision in specialized fields. This fine-tuned model is optimized for handling technical terminology in the **Artificial Intelligence (AI)** domain.
## Example Usage
Hereโs how to use this fine-tuned model with the Hugging Face `transformers` library:
<span style="color:red">*Note:</span> `M2M100Tokenizer` depends on <span style="color:blue">sentencepiece</span>, so make sure to install it before running the example.* To install `sentencepiece`, run `pip install sentencepiece`
```python
from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer
model_name = "PrompTart/m2m100_418M_PTT_en_ko"
tokenizer = M2M100Tokenizer.from_pretrained(model_name)
model = M2M100ForConditionalGeneration.from_pretrained(model_name)
# Example sentence
text = "The model was fine-tuned using knowledge distillation techniques.\
The training dataset was created using a collaborative multi-agent framework powered by large language models."
# Tokenize and generate translation
tokenizer.src_lang = "en"
encoded = tokenizer(text.split('. '), return_tensors="pt", padding=True)
generated_tokens = model.generate(**encoded, forced_bos_token_id=tokenizer.get_lang_id("ko"))
outputs = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
print('\n'.join(outputs))
# => "์ด ๋ชจ๋ธ์ ์ง์ ์ฆ๋ฅ ๊ธฐ๋ฒ(knowledge distillation techniques)์ ์ฌ์ฉํ์ฌ ๋ฏธ์ธ ์กฐ์ ๋์์ต๋๋ค.
# ํ๋ จ ๋ฐ์ดํฐ์
(training dataset)์ ๋ํ ์ธ์ด ๋ชจ๋ธ(large language models)์ ๊ธฐ๋ฐ์ผ๋ก ํ ํ์
๋ค์ค ์์ด์ ํธ ํ๋ ์์ํฌ(collaborative multi-agent framework)๋ฅผ ์ฌ์ฉํ์ฌ ์์ฑ๋์์ต๋๋ค."
```
## Limitations
- **Out-of-Domain Accuracy**: While the model generalizes to some extent, accuracy may vary in domains that were not part of the training set.
- **Incomplete Parenthetical Annotation**: Not all technical terms are consistently displayed in parentheses; in some cases, terms may be omitted or not annotated as expected.
## Citation
If you use this model in your research, please cite the original dataset and paper:
```tex
@misc{myung2024efficienttechnicaltermtranslation,
title={Efficient Technical Term Translation: A Knowledge Distillation Approach for Parenthetical Terminology Translation},
author={Jiyoon Myung and Jihyeon Park and Jungki Son and Kyungro Lee and Joohyung Han},
year={2024},
eprint={2410.00683},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2410.00683},
}
```
## Contact
For questions or feedback, please contact [jiyoon0424@gmail.com](mailto:jiyoon0424@gmail.com). |