thebogko
/

mt5-finetuned-bulgarian-grammar-mistakes

@@ -91,100 +91,89 @@ print(correct_sentence)
 ### Training Data
 <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
 ### Training Procedure
 <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
 #### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
 ## Evaluation
 <!-- This section describes the evaluation protocols and provides the results. -->
 ### Testing Data, Factors & Metrics
 #### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
 #### Metrics
 <!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
 ### Results
-[More Information Needed]
 #### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
-## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
-### Model Architecture and Objective
-[More Information Needed]
-### Compute Infrastructure
-[More Information Needed]
-#### Hardware
-[More Information Needed]
-#### Software
-[More Information Needed]
 ## Citation [optional]
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
 **BibTeX:**
 [More Information Needed]
@@ -195,8 +184,6 @@ Carbon emissions can be estimated using the [Machine Learning Impact calculator]
 ## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
 [More Information Needed]
 ## More Information [optional]
@@ -209,4 +196,4 @@ Carbon emissions can be estimated using the [Machine Learning Impact calculator]
 ## Model Card Contact
-[More Information Needed]

 ### Training Data
 <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+The training data used is from a collection of [Bulgarian grammar mistakes](https://huggingface.co/datasets/thebogko/bulgarian-grammar-mistakes), which contains 7.59k rows of data, spanning over four different types of grammar errors:
+1) **Misuse of articles**
+2) **Misuse of pronouns**
+3) Incorrect appending of 'me' for plural verbs in the first person
+4) Word disagreement between nouns and adjectives in terms of grammatical gender and number
+Only the first two were used in the fine-tuning of this model, as the rationale was that these two types of errors are much more common overall (especially with native Bulgarian speakers), and it would allow the model to focus on these.
+After filtering only these two types we are left with 3090 pairs, which were then split into training/validation/test (72/18/10), respectively. With this split we are left with 2224 training pairs.
 ### Training Procedure
 <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+The standard fine-tuning training procedure was applied by creating batches from the training samples and eavluating on each epoch. The model weights are optimised using cross-entropy loss.
 #### Training Hyperparameters
+Gridspace search was applied to find the best learning rate, epoch number, weight decay and batch size. The chosen setup at the end of expimentation stage was chosen to be:
+  1) **batch_size**: 4
+  2) **learning_rate**: 0.0002
+  3) **wight_decay**: 0.001
+  4) **epoch number**: 4
+This gridspace search was performed 3 separate times, and it resulted in the lowest avearge validation loss of 0.01431.
 ## Evaluation
 <!-- This section describes the evaluation protocols and provides the results. -->
+Evaluation was performed against four other models:
+  - bespoke RNN encoder-decoder model with attention
+  - [gpt3.5 Turbo model](https://platform.openai.com/docs/models/gpt-3-5-turbo) by [OpenAI](https://openai.com)
+  - [BgGPT model](https://huggingface.co/INSAIT-Institute/BgGPT-7B-Instruct-v0.1) by [INSAIT](https://insait.ai)
 ### Testing Data, Factors & Metrics
 #### Testing Data
+The testing data is 309 pairs, from the original train/validation/test split of (72/18/10) over 3090 pairs.
 #### Metrics
 <!-- These are the evaluation metrics being used, ideally with a description of why. -->
+Evaluated using recall, precision, f1 score, f0.5 score and BLEU.
 ### Results
+The resuls are averaged over the testing pairs.
+**mt5-base finetuned bulgarian-grammar-mistakes**:
+  - precision: **0.6812**
+  - recall: **0.6861**
+  - f1 score: **0.6828**
+  - f0.5 score: **0.6818**
+  - BLEU: **0.9623**
+**gpt3.5 Turbo**
+  - precision: 0.3751
+  - recall: 0.6052
+  - f1 score: 0.4331
+  - f0.5 score: 0.3934
+  - BLEU: 0.7666
+**BgGPT**
+  - precision: 0.3307
+  - recall: 0.5987
+  - f1 score: 0.3934
+  - f0.5 score: 0.3503
+  - BLEU: 0.7110
+**RNN encoder-decoder model with attention**
+  - precision: 0.1717
+  - recall: 0.2362
+  - f1 score: 0.1820
+  - f0.5 score: 0.1748
+  - BLEU: 0.2087
 #### Summary
+The evaluation showcases that the fine-tuned model ourperforms all other models across the chosen metrics, particularly precision. This implies that the model's strength lies in being able to ensure that the corrections it makes are, in fact, valid, as opposed to the other models, all of which exhibit a recall value that's much higher than their respecrive precision.
+<!--
 ## Citation [optional]
 **BibTeX:**
 [More Information Needed]
 ## Glossary [optional]
 [More Information Needed]
 ## More Information [optional]
 ## Model Card Contact
+[More Information Needed]-->