Fill-Mask
Transformers
PyTorch
Safetensors
deberta
Generated from Trainer
Inference Endpoints
Sakonii commited on
Commit
b96f7f8
1 Parent(s): 2bbc891

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +80 -16
README.md CHANGED
@@ -1,34 +1,98 @@
1
  ---
2
  license: mit
 
3
  tags:
4
  - generated_from_trainer
5
  model-index:
6
  - name: de-berta-base-base-nepali
7
  results: []
 
 
 
 
 
 
 
8
  ---
9
 
10
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
11
- should probably proofread and complete it, then remove this comment. -->
12
-
13
  # de-berta-base-base-nepali
14
 
15
- This model is a fine-tuned version of [microsoft/deberta-base](https://huggingface.co/microsoft/deberta-base) on the None dataset.
 
16
  It achieves the following results on the evaluation set:
17
- - Loss: 1.8600
 
 
 
18
 
19
  ## Model description
20
 
21
- More information needed
22
 
23
  ## Intended uses & limitations
24
 
25
- More information needed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26
 
27
- ## Training and evaluation data
 
28
 
29
- More information needed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30
 
31
  ## Training procedure
 
 
32
 
33
  ### Training hyperparameters
34
 
@@ -44,13 +108,13 @@ The following hyperparameters were used during training:
44
 
45
  ### Training results
46
 
47
- | Training Loss | Epoch | Step | Validation Loss |
48
- |:-------------:|:-----:|:------:|:---------------:|
49
- | 2.5454 | 1.0 | 188789 | 2.4273 |
50
- | 2.2592 | 2.0 | 377578 | 2.1448 |
51
- | 2.1171 | 3.0 | 566367 | 2.0030 |
52
- | 2.0227 | 4.0 | 755156 | 1.9133 |
53
- | 1.9375 | 5.0 | 943945 | 1.8600 |
54
 
55
 
56
  ### Framework versions
 
1
  ---
2
  license: mit
3
+ mask_token: "<mask>"
4
  tags:
5
  - generated_from_trainer
6
  model-index:
7
  - name: de-berta-base-base-nepali
8
  results: []
9
+ widget:
10
+ - text: "मानविय गतिविधिले प्रातृतिक पर्यावरन प्रनालीलाई अपरिमेय क्षति पु्र्याएको छ। परिवर्तनशिल जलवायुले खाध, सुरक्षा, <mask>, जमिन, मौसमलगायतलाई असंख्य तरिकाले प्रभावित छ।"
11
+ example_title: "Example 1"
12
+ - text: "अचेल विद्यालय र कलेजहरूले स्मारिका कत्तिको प्रकाशन गर्छन्, यकिन छैन । केही वर्षपहिलेसम्म गाउँसहरका सानाठूला <mask> संस्थाहरूमा पुग्दा शिक्षक वा कर्मचारीले संस्थाबाट प्रकाशित पत्रिका, स्मारिका र पुस्तक कोसेलीका रूपमा थमाउँथे ।"
13
+ example_title: "Example 2"
14
+ - text: "जलविद्युत् विकासको ११० वर्षको इतिहास बनाएको नेपालमा हाल सरकारी र निजी क्षेत्रबाट गरी करिब २ हजार मेगावाट <mask> उत्पादन भइरहेको छ ।"
15
+ example_title: "Example 3"
16
  ---
17
 
 
 
 
18
  # de-berta-base-base-nepali
19
 
20
+ This model is pre-trained on [nepalitext](https://huggingface.co/datasets/Sakonii/nepalitext-language-model-dataset) dataset consisting of over 13 million Nepali text sequences using a masked language modeling (MLM) objective. Our approach trains a Sentence Piece Model (SPM) for text tokenization similar to [XLM-ROBERTa](https://arxiv.org/abs/1911.02116) and trains [DeBERTa](https://arxiv.org/abs/2006.03654) for language modeling.
21
+
22
  It achieves the following results on the evaluation set:
23
+
24
+ mlm probability|evaluation loss|evaluation perplexity
25
+ --:|----:|-----:|
26
+ 20%|1.860|6.424|
27
 
28
  ## Model description
29
 
30
+ Refer to original [microsoft/deberta-base](https://huggingface.co/microsoft/deberta-base)
31
 
32
  ## Intended uses & limitations
33
 
34
+ This backbone model intends to be fine-tuned on Nepali language focused downstream task such as sequence classification, token classification or question answering.
35
+ The language model being trained on a data with texts grouped to a block size of 512, it handles text sequence up to 512 tokens and may not perform satisfactorily on shorter sequences.
36
+
37
+ ## Usage
38
+
39
+ This model can be used directly with a pipeline for masked language modeling:
40
+
41
+ ```python
42
+ >>> from transformers import pipeline
43
+ >>> unmasker = pipeline('fill-mask', model='Sakonii/deberta-base-nepali')
44
+ >>> unmasker("मानविय गतिविधिले प्रातृतिक पर्यावरन प्रनालीलाई अपरिमेय क्षति पु्र्याएको छ। परिवर्तनशिल जलवायुले खाध, सुरक्षा, <mask>, जमिन, मौसमलगायतलाई असंख्य तरिकाले प्रभावित छ।")
45
+
46
+ [{'score': 0.10054448992013931,
47
+ 'sequence': 'मानविय गतिविधिले प्रातृतिक पर्यावरन प्रनालीलाई अपरिमेय क्षति पु्र्याएको छ। परिवर्तनशिल जलवायुले खाध, सुरक्षा, वातावरण, जमिन, मौसमलगायतलाई असंख्य तरिकाले प्रभावित छ।',
48
+ 'token': 790,
49
+ 'token_str': 'वातावरण'},
50
+ {'score': 0.05399947986006737,
51
+ 'sequence': 'मानविय गतिविधिले प्रातृतिक पर्यावरन प्रनालीलाई अपरिमेय क्षति पु्र्याएको छ। परिवर्तनशिल जलवायुले खाध, सुरक्षा, स्वास्थ्य, जमिन, मौसमलगायतलाई असंख्य तरिकाले प्रभावित छ।',
52
+ 'token': 231,
53
+ 'token_str': 'स्वास्थ्य'},
54
+ {'score': 0.045006219297647476,
55
+ 'sequence': 'मानविय गतिविधिले प्रातृतिक पर्यावरन प्रनालीलाई अपरिमेय क्षति पु्र्याएको छ। परिवर्तनशिल जलवायुले खाध, सुरक्षा, जल, जमिन, मौसमलगायतलाई असंख्य तरिकाले प्रभावित छ।',
56
+ 'token': 1313,
57
+ 'token_str': 'जल'},
58
+ {'score': 0.04032573476433754,
59
+ 'sequence': 'मानविय गतिविधिले प्रातृतिक पर्यावरन प्रनालीलाई अपरिमेय क्षति पु्र्याएको छ। परिवर्तनशिल जलवायुले खाध, सुरक्षा, पर्यावरण, जमिन, मौसमलगायतलाई असंख्य तरिकाले प्रभावित छ।',
60
+ 'token': 13156,
61
+ 'token_str': 'पर्यावरण'},
62
+ {'score': 0.026729246601462364,
63
+ 'sequence': 'मानविय गतिविधिले प्रातृतिक पर्यावरन प्रनालीलाई अपरिमेय क्षति पु्र्याएको छ। परिवर्तनशिल जलवायुले खाध, सुरक्षा, संचार, जमिन, मौसमलगायतलाई असंख्य तरिकाले प्रभावित छ।',
64
+ 'token': 3996,
65
+ 'token_str': 'संचार'}]
66
+ ```
67
+
68
+ Here is how we can use the model to get the features of a given text in PyTorch:
69
 
70
+ ```python
71
+ from transformers import AutoTokenizer, AutoModelForMaskedLM
72
 
73
+ tokenizer = AutoTokenizer.from_pretrained('Sakonii/deberta-base-nepali')
74
+ model = AutoModelForMaskedLM.from_pretrained('Sakonii/deberta-base-nepali')
75
+
76
+ # prepare input
77
+ text = "चाहिएको text यता राख्नु होला।"
78
+ encoded_input = tokenizer(text, return_tensors='pt')
79
+
80
+ # forward pass
81
+ output = model(**encoded_input)
82
+ ```
83
+
84
+ ## Training data
85
+
86
+ This model is trained on [nepalitext](https://huggingface.co/datasets/Sakonii/nepalitext-language-model-dataset) language modeling dataset which combines the datasets: [OSCAR](https://huggingface.co/datasets/oscar) , [cc100](https://huggingface.co/datasets/cc100) and a set of scraped Nepali articles on Wikipedia.
87
+ As for training the language model, the texts in the training set are grouped to a block of 512 tokens.
88
+
89
+ ## Tokenization
90
+
91
+ A Sentence Piece Model (SPM) is trained on a subset of [nepalitext](https://huggingface.co/datasets/Sakonii/nepalitext-language-model-dataset) dataset for text tokenization. The tokenizer trained with vocab-size=24576, min-frequency=4, limit-alphabet=1000 and model-max-length=512.
92
 
93
  ## Training procedure
94
+ The model is trained with the same configuration as the original [microsoft/deberta-base](https://huggingface.co/microsoft/deberta-base); 512 tokens per instance, 6 instances per batch, and around 188.8K training steps (per epoch).
95
+
96
 
97
  ### Training hyperparameters
98
 
 
108
 
109
  ### Training results
110
 
111
+ | Training Loss | Epoch | Step | Validation Loss | Perplexity |
112
+ |:-------------:|:-----:|:------:|:---------------:|:----------:|
113
+ | 2.5454 | 1.0 | 188789 | 2.4273 | 11.3283 |
114
+ | 2.2592 | 2.0 | 377578 | 2.1448 | 8.5403 |
115
+ | 2.1171 | 3.0 | 566367 | 2.0030 | 7.4113 |
116
+ | 2.0227 | 4.0 | 755156 | 1.9133 | 6.7754 |
117
+ | 1.9375 | 5.0 | 943945 | 1.8600 | 6.4237 |
118
 
119
 
120
  ### Framework versions