PrompTart
/

m2m100_418M_PTT_en_ko

+---
+datasets:
+- PrompTart/PTT_advanced_en_ko
+language:
+- en
+- ko
+base_model:
+- facebook/m2m100_418M
+---
+# M2M100 Fine-Tuned on Parenthetical Terminology Translation (PTT) Dataset
+## Model Overview
+This is a **M2M100** model fine-tuned on the [**Parenthetical Terminology Translation (PTT)**](https://huggingface.co/datasets/PrompTart/PTT_advancded_en_ko) dataset. The PTT dataset focuses on translating technical terms accurately by placing the original English term in parentheses alongside its Korean translation, enhancing clarity and precision in specialized fields. This fine-tuned model is optimized for handling technical terminology in the **Artificial Intelligence (AI)** domain.
+## Intended Use
+This model is designed for tasks that involve:
+- **Technical Term Translation**: Especially helpful for domains with frequent use of specialized terminology(e.g. AI).
+- **Parenthetical Terminology Translation (PTT)**: Retaining the original English term in parentheses to support reader comprehension and reduce ambiguity.
+It is suitable for applications in machine translation where technical term accuracy is critical, including **academic research translation**, **technical documentation**, and **domain-specific machine translation**.
+## Model Details
+- **Base Model**: M2M100
+- **Training Data**: Parenthetical Terminology Translation (PTT) dataset
+- **Languages**: English to Korean
+- **Domains**: Artificial Intelligence (AI)
+## Example Usage
+Here’s how to use this fine-tuned model with the Hugging Face `transformers` library:
+```python
+from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer
+model_name = "PrompTart/m2m100_418M_PTT_en_ko"
+tokenizer = M2M100Tokenizer.from_pretrained(model_name)
+model = M2M100ForConditionalGeneration.from_pretrained(model_name)
+# Example sentence
+text = "The model was fine-tuned using knowledge distillation techniques."
+# Tokenize and generate translation
+tokenizer.src_lang = "en"
+encoded_hi = tokenizer(text, return_tensors="pt")
+generated_tokens = model.generate(**encoded_hi, forced_bos_token_id=tokenizer.get_lang_id("ko"))
+tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
+# => "이 모델은 지식 증류 기법(knowledge distillation techniques)을 사용하여 미세 조정되었습니다."
+```
+## Limitations
+- **Out-of-Domain Accuracy**: While the model generalizes to some extent, accuracy may vary in domains that were not part of the training set (such as specialized biology or physics terms not included in PTT).
+## Citation
+If you use this model in your research, please cite the original dataset and paper:
+```tex
+@misc{myung2024efficienttechnicaltermtranslation,
+      title={Efficient Technical Term Translation: A Knowledge Distillation Approach for Parenthetical Terminology Translation},
+      author={Jiyoon Myung and Jihyeon Park and Jungki Son and Kyungro Lee and Joohyung Han},
+      year={2024},
+      eprint={2410.00683},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2410.00683},
+}
+```
+## Contact
+For questions or feedback, please contact [jiyoon0424@gmail.com](mailto:jiyoon0424@gmail.com).