PrompTart commited on
Commit
d95ea53
1 Parent(s): be37d10

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +77 -0
README.md ADDED
@@ -0,0 +1,77 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ datasets:
3
+ - PrompTart/PTT_advanced_en_ko
4
+ language:
5
+ - en
6
+ - ko
7
+ base_model:
8
+ - facebook/m2m100_418M
9
+ ---
10
+
11
+ # M2M100 Fine-Tuned on Parenthetical Terminology Translation (PTT) Dataset
12
+
13
+ ## Model Overview
14
+
15
+ This is a **M2M100** model fine-tuned on the [**Parenthetical Terminology Translation (PTT)**](https://huggingface.co/datasets/PrompTart/PTT_advancded_en_ko) dataset. The PTT dataset focuses on translating technical terms accurately by placing the original English term in parentheses alongside its Korean translation, enhancing clarity and precision in specialized fields. This fine-tuned model is optimized for handling technical terminology in the **Artificial Intelligence (AI)** domain.
16
+
17
+ ## Intended Use
18
+
19
+ This model is designed for tasks that involve:
20
+ - **Technical Term Translation**: Especially helpful for domains with frequent use of specialized terminology(e.g. AI).
21
+ - **Parenthetical Terminology Translation (PTT)**: Retaining the original English term in parentheses to support reader comprehension and reduce ambiguity.
22
+
23
+ It is suitable for applications in machine translation where technical term accuracy is critical, including **academic research translation**, **technical documentation**, and **domain-specific machine translation**.
24
+
25
+ ## Model Details
26
+
27
+ - **Base Model**: M2M100
28
+ - **Training Data**: Parenthetical Terminology Translation (PTT) dataset
29
+ - **Languages**: English to Korean
30
+ - **Domains**: Artificial Intelligence (AI)
31
+
32
+ ## Example Usage
33
+
34
+ Here’s how to use this fine-tuned model with the Hugging Face `transformers` library:
35
+ ```python
36
+ from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer
37
+
38
+ model_name = "PrompTart/m2m100_418M_PTT_en_ko"
39
+ tokenizer = M2M100Tokenizer.from_pretrained(model_name)
40
+ model = M2M100ForConditionalGeneration.from_pretrained(model_name)
41
+
42
+ # Example sentence
43
+ text = "The model was fine-tuned using knowledge distillation techniques."
44
+
45
+ # Tokenize and generate translation
46
+ tokenizer.src_lang = "en"
47
+ encoded_hi = tokenizer(text, return_tensors="pt")
48
+ generated_tokens = model.generate(**encoded_hi, forced_bos_token_id=tokenizer.get_lang_id("ko"))
49
+ tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
50
+ # => "이 모델은 지식 증류 기법(knowledge distillation techniques)을 사용하여 미세 조정되었습니다."
51
+
52
+ ```
53
+
54
+ ## Limitations
55
+
56
+ - **Out-of-Domain Accuracy**: While the model generalizes to some extent, accuracy may vary in domains that were not part of the training set (such as specialized biology or physics terms not included in PTT).
57
+
58
+
59
+ ## Citation
60
+
61
+ If you use this model in your research, please cite the original dataset and paper:
62
+
63
+ ```tex
64
+ @misc{myung2024efficienttechnicaltermtranslation,
65
+ title={Efficient Technical Term Translation: A Knowledge Distillation Approach for Parenthetical Terminology Translation},
66
+ author={Jiyoon Myung and Jihyeon Park and Jungki Son and Kyungro Lee and Joohyung Han},
67
+ year={2024},
68
+ eprint={2410.00683},
69
+ archivePrefix={arXiv},
70
+ primaryClass={cs.CL},
71
+ url={https://arxiv.org/abs/2410.00683},
72
+ }
73
+ ```
74
+
75
+ ## Contact
76
+
77
+ For questions or feedback, please contact [jiyoon0424@gmail.com](mailto:jiyoon0424@gmail.com).