File size: 3,221 Bytes
d95ea53
 
 
 
 
 
 
 
e315939
d95ea53
 
 
 
 
 
bff6a1d
d95ea53
 
 
 
a0a8e19
17f0170
edb3e69
17f0170
d95ea53
 
 
 
 
 
 
 
1c4b1e3
 
d95ea53
 
 
06b7708
93f5e7f
06b7708
3ae9bc6
1c4b1e3
 
d95ea53
 
 
 
 
166d7c3
e708c8f
d95ea53
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e315939
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
---
datasets:
- PrompTart/PTT_advanced_en_ko
language:
- en
- ko
base_model:
- facebook/m2m100_418M
library_name: transformers
---

# M2M100 Fine-Tuned on Parenthetical Terminology Translation (PTT) Dataset

## Model Overview

This is a **M2M100** model fine-tuned on the [**Parenthetical Terminology Translation (PTT)**](https://arxiv.org/abs/2410.00683) dataset. [The PTT dataset](https://huggingface.co/datasets/PrompTart/PTT_advanced_en_ko) focuses on translating technical terms accurately by placing the original English term in parentheses alongside its Korean translation, enhancing clarity and precision in specialized fields. This fine-tuned model is optimized for handling technical terminology in the **Artificial Intelligence (AI)** domain.


## Example Usage

Hereโ€™s how to use this fine-tuned model with the Hugging Face `transformers` library:

<span style="color:red">*Note:</span> `M2M100Tokenizer` depends on <span style="color:blue">sentencepiece</span>, so make sure to install it before running the example.* To install `sentencepiece`, run `pip install sentencepiece`

```python
from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer

model_name = "PrompTart/m2m100_418M_PTT_en_ko"
tokenizer = M2M100Tokenizer.from_pretrained(model_name)
model = M2M100ForConditionalGeneration.from_pretrained(model_name)

# Example sentence
text = "The model was fine-tuned using knowledge distillation techniques.\
The training dataset was created using a collaborative multi-agent framework powered by large language models."

# Tokenize and generate translation
tokenizer.src_lang = "en"
encoded = tokenizer(text.split('. '), return_tensors="pt", padding=True)
generated_tokens = model.generate(**encoded, forced_bos_token_id=tokenizer.get_lang_id("ko"))
outputs = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
print('\n'.join(outputs))
# => "์ด ๋ชจ๋ธ์€ ์ง€์‹ ์ฆ๋ฅ˜ ๊ธฐ๋ฒ•(knowledge distillation techniques)์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ฏธ์„ธ ์กฐ์ •๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
# ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์…‹(training dataset)์€ ๋Œ€ํ˜• ์–ธ์–ด ๋ชจ๋ธ(large language models)์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ํ˜‘์—… ๋‹ค์ค‘ ์—์ด์ „ํŠธ ํ”„๋ ˆ์ž„์›Œํฌ(collaborative multi-agent framework)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ƒ์„ฑ๋˜์—ˆ์Šต๋‹ˆ๋‹ค."

```

## Limitations

- **Out-of-Domain Accuracy**: While the model generalizes to some extent, accuracy may vary in domains that were not part of the training set.
- **Incomplete Parenthetical Annotation**: Not all technical terms are consistently displayed in parentheses; in some cases, terms may be omitted or not annotated as expected.

## Citation

If you use this model in your research, please cite the original dataset and paper:

```tex
@misc{myung2024efficienttechnicaltermtranslation,
      title={Efficient Technical Term Translation: A Knowledge Distillation Approach for Parenthetical Terminology Translation}, 
      author={Jiyoon Myung and Jihyeon Park and Jungki Son and Kyungro Lee and Joohyung Han},
      year={2024},
      eprint={2410.00683},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.00683}, 
}
```

## Contact

For questions or feedback, please contact [jiyoon0424@gmail.com](mailto:jiyoon0424@gmail.com).