---
language:
- ko
metrics:
- bleu
pipeline_tag: text2text-generation
---
# 🌊 제주어, 표준어 양방향 번역 모델 (Jeju-Standard Bidirectional Translation Model)
## **1. Introduction**
### 🧑‍🤝‍🧑**Member**
- **Bitamin 12기 : 구준회, 이서현, 이예린**
- **Bitamin 13기 : 김윤영, 김재겸, 이형석**

### **Github Link**
- https://github.com/junhoeKu/Jeju_Translation.github.io

### **How to use this Model**
- You can use this model with `transformers` to perform inference. 
- Below is an example of how to load the model and generate translations:
  
```python
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

## Set up the device (GPU or CPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

## Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("Junhoee/Kobart-Jeju-translation")
model = AutoModelForSeq2SeqLM.from_pretrained("Junhoee/Kobart-Jeju-translation").to(device)

## Set up the input text
## 문장 입력 전에 방향에 맞게 [제주] or [표준] 토큰을 입력 후 문장 입력
input_text = "[표준] 안녕하세요"

## Tokenize the input text
input_ids = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True).input_ids.to(device)

## Generate the translation
outputs = model.generate(input_ids, max_length=64)

## Decode and print the output
decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Model Output:", decoded_output)
```
```java
Model Output: 안녕하수꽈
```

---

```python
## Set up the input text
## 문장 입력 전에 방향에 맞게 [제주] or [표준] 토큰을 입력 후 문장 입력
input_text = "[제주] 안녕하수꽈"

## Tokenize the input text
input_ids = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True).input_ids.to(device)

## Generate the translation
outputs = model.generate(input_ids, max_length=64)

## Decode and print the output
decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Model Output:", decoded_output)
```
```java
Model Output: 안녕하세요
```

### **Parent Model**
- gogamza/kobart-base-v2
- https://huggingface.co/gogamza/kobart-base-v2

## **2. Dataset - 약 93만 개의 행**
- AI-Hub (제주어 발화 데이터 + 중년층 방언 발화 데이터)
- Github (카카오브레인 JIT 데이터)
- 그 외
  - 제주어사전 데이터 (제주도청 홈페이지에서 크롤링)
  - 가사 번역 데이터 (뭐랭하맨 유튜브에서 일일이 수집)
  - 도서 데이터 (제주방언 그 맛과 멋, 부에나도 지꺼져도 도서에서 일일이 수집)
  - 2018년도 제주어 구술 자료집 (일일이 수집 - 평가용 데이터로 사용)

## **3. Hyper Parameters**
- Epoch : 3 epochs
- Learning Rate : 2e-5
- Weight Decay=0.01
- Batch Size : 32

## **4. Bleu Score**
- 2018 제주어 구술 자료집 데이터 기준
  - 제주어 -> 표준어 : 0.76
  - 표준어 -> 제주어 : 0.5

- AI-Hub 제주어 발화 데이터의 validation data 기준
  - 제주어 -> 표준어 : 0.89
  - 표준어 -> 제주어 : 0.77

## **5. CREDIT**
- 구준회 : kujoon13413@gmail.com
- 김윤영 : 202000872@hufs.ac.kr
- 김재겸 : worua5667@inha.edu
- 이서현 : rlaorrn0123@sookmyung.ac.kr
- 이예린 : i75631928@gmail.com
- 이형석 : gudtjr3638@gmail.com