File size: 4,540 Bytes
7045d51
 
 
 
 
 
 
 
 
 
eca802b
 
7045d51
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
eca802b
 
7045d51
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
---
license: cc-by-nc-4.0
language:
- bo
base_model: google-t5/t5-small
tags:
- nlp
- transliteration
- tibetan
- buddhism
datasets:
- billingsmoore/tibetan-phonetic-transliteration-dataset
---
# Model Card for tibetan-phonetic-transliteration

This model is a text2text generation model for phonetic transliteration of Tibetan script.

## Model Details

### Model Description

<!-- Provide a longer summary of what this model is. -->



- **Developed by:** billingsmoore
- **Model type:** text2text generation
- **Language(s) (NLP):** Tibetan
- **License:** [Attribution-NonCommercial 4.0 International ](Attribution-NonCommercial 4.0 International )
- **Finetuned from model:** ['google-t5/t5-small'](https://huggingface.co/google-t5/t5-small)

### Model Sources

- **Repository:** [https://github.com/billingsmoore/MLotsawa](https://github.com/billingsmoore/MLotsawa)

## Uses

The intended use of this model is to provide phonetic transliteration of Tibetan script, typically as part of a larger Tibetan translation ecosystem.

### Direct Use

To use the model for transliteration in a python script, you can use the transformers library like so:

```python
from transformers import pipeline

transliterator = pipeline('translation',model='billingsmoore/tibetan-phonetic-transliteration')

transliterated_text = transliterator(<string of unicode Tibetan script>)

```

### Downstream Use 

The model can be finetuned for a specific use case using the following code.

```python
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorForSeq2Seq, AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer, Adafactor
from accelerate import Accelerator

dataset = load_dataset(<your dataset>)
dataset = dataset['train'].train_test_split(.1)

checkpoint = "billingsmoore/tibetan-phonetic-transliteration"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint, device_map="auto")
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint)

source_lang = 'bo'
target_lang = 'phon'

def preprocess_function(examples):

    inputs = [example for example in examples[source_lang]]
    targets = [example for example in examples[target_lang]]
    
    model_inputs = tokenizer(inputs, text_target=targets, max_length=256, truncation=True, padding="max_length")

    return model_inputs

tokenized_dataset = dataset.map(preprocess_function, batched=True)

optimizer = Adafactor(
    model.parameters(), 
    scale_parameter=True, 
    relative_step=False, 
    warmup_init=False, 
    lr=3e-4
)

accelerator = Accelerator()
model, optimizer = accelerator.prepare(model, optimizer)

training_args = Seq2SeqTrainingArguments(
    output_dir=".",
    auto_find_batch_size=True,
    predict_with_generate=True,
    fp16=False,
    push_to_hub=False,
    eval_strategy='epoch',
    save_strategy='epoch',
    load_best_model_at_end=True,
    num_train_epochs=5
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset['train'],
    eval_dataset=tokenized_dataset['test'],
    tokenizer=tokenizer,
    optimizers=(optimizer, None),
    data_collator=data_collator
)

trainer.train()
```

## Bias, Risks, and Limitations

This model was trained exclusively on material from the Tibetan Buddhist canon and thus on Literary Tibetan. 
It may not perform satisfactorily on texts from other corpi or on other dialects of Tibetan.

### Recommendations

For users who wish to use the model for other texts, I recommend further finetuning on your own dataset using the instructions above.

## Training Details

This model was trained on 98597 pairs of text, the first member of which is a line of unicode Tibetan text, the second (the target) is a the phonetic transliteration of the first.
This dataset was scraped from Lotsawa House and is released on Kaggle under the same license as the texts from which it is sourced.
[You can find this dataset and more information on Kaggle by clicking here.](https://www.kaggle.com/datasets/billingsmoore/tibetan-phonetic-transliteration-pairs)
[You can find this dataset and more information on Huggingface by clicking here.](https://huggingface.co/datasets/billingsmoore/tibetan-phonetic-transliteration-dataset)

This model was trained for five epochs. Further information regarding training can be found in the documentation of the [MLotsawa repository](https://github.com/billingsmoore/MLotsawa).

## Model Card Contact

billingsmoore [at] gmail [dot] com