Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,76 @@
|
|
1 |
---
|
2 |
-
license:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
license: apache-2.0
|
3 |
+
language:
|
4 |
+
- bn
|
5 |
+
metrics:
|
6 |
+
- wer
|
7 |
+
- cer
|
8 |
+
tags:
|
9 |
+
- seq2seq
|
10 |
+
- ipa
|
11 |
+
- bengali
|
12 |
+
- byt5
|
13 |
+
widget:
|
14 |
+
- text: <Narail> আমি সে বাবুর মামু বাড়ি গিছিলাম।
|
15 |
+
example_title: Narail Text
|
16 |
+
- text: <Rangpur> এখন এই কুলো তার শেষ অই কুলো তার শেষ।
|
17 |
+
example_title: Rangpur Text
|
18 |
+
- text: <Chittagong> খয়দে সিআরের এইল্লা কি অবস্থা!
|
19 |
+
example_title: Chittagong Text
|
20 |
+
- text: <Kishoreganj> আটাইশ করছিলাম দের কানি ক্ষেত, ইবার মাইর কাইছি।
|
21 |
+
example_title: Kishoreganj Text
|
22 |
+
- text: <Narsingdi> তারা তো ওই খারাপ খেইলাই আসে না।
|
23 |
+
example_title: Narsingdi Text
|
24 |
+
- text: <Tangail> আর সব থেকে ফানি কথা হইতেছে দেখ?
|
25 |
+
example_title: Tangail Text
|
26 |
---
|
27 |
+
|
28 |
+
# Regional bengali text to IPA transcription - umt5-base
|
29 |
+
|
30 |
+
|
31 |
+
This is a fine-tuned version of the [google/umt5-base](https://huggingface.co/google/umt5-base) for the task of generating IPA transcriptions from regional bengali text.
|
32 |
+
This was done on the dataset of the competition [“ভাষামূল: মুখের ভাষার খোঁজে“](https://www.kaggle.com/competitions/regipa/overview) by Bengali.AI.
|
33 |
+
|
34 |
+
Scores achieved till now (test scores):
|
35 |
+
- **Word error rate (wer)**: 0.02390405721962450
|
36 |
+
- **Char error rate (cer)**: 0.01011514943093060
|
37 |
+
|
38 |
+
Supported district tokens:
|
39 |
+
- Kishoreganj
|
40 |
+
- Narail
|
41 |
+
- Narsingdi
|
42 |
+
- Chittagong
|
43 |
+
- Rangpur
|
44 |
+
- Tangail
|
45 |
+
|
46 |
+
---
|
47 |
+
|
48 |
+
## Loading & using the model
|
49 |
+
```python
|
50 |
+
# Load model directly
|
51 |
+
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
|
52 |
+
tokenizer = AutoTokenizer.from_pretrained("teamapocalypseml/ben2ipa-umt5base")
|
53 |
+
model = AutoModelForSeq2SeqLM.from_pretrained("teamapocalypseml/ben2ipa-umt5base")
|
54 |
+
"""
|
55 |
+
The format of the input text MUST BE: <district> <bengali_text>
|
56 |
+
"""
|
57 |
+
text = "<district> bengali_text_here"
|
58 |
+
text_ids = tokenizer(text, return_tensors='pt').input_ids
|
59 |
+
model(text_ids)
|
60 |
+
```
|
61 |
+
|
62 |
+
|
63 |
+
## Using the pipeline
|
64 |
+
```python
|
65 |
+
# Use a pipeline as a high-level helper
|
66 |
+
from transformers import pipeline
|
67 |
+
device = "cuda" if torch.cuda.is_available() else "cpu"
|
68 |
+
pipe = pipeline("text2text-generation", model="teamapocalypseml/ben2ipa-umt5base", device=device)
|
69 |
+
"""
|
70 |
+
`texts` must be in the format of: <district> <contents>
|
71 |
+
"""
|
72 |
+
outputs = pipe(texts, max_length=512, batch_size=batch_size)
|
73 |
+
```
|
74 |
+
|
75 |
+
## Credits
|
76 |
+
Done by [S M Jishanul Islam](https://huggingface.co/smji), [Sadia Ahmmed](https://huggingface.co/sadiaahmmed), [Sahid Hossain Mustakim](https://huggingface.co/rhsm15)
|