unb-lamfo-nlp-mcti
/

NLP-Classification-MCTI

English

Clsssification

science

Model card Files Files and versions Community

MarcosDib commited on Dec 13, 2022

Commit

5abd0fc

1 Parent(s): 9f6fdc6

Update README.md

Browse files

Files changed (1) hide show

README.md +6 -6

README.md CHANGED Viewed

@@ -180,7 +180,7 @@ This bias will also affect all fine-tuned versions of this model.
 ## Training data
-The [inputted training](https://github.com/chap0lin/PPF-MCTI/tree/master/Datasets) data was obtained from scrapping techniques, over 30 different platforms e.g. The Royal Society,
 Annenberg foundation, and contained 928 labeled entries (928 rows x 21 columns). Of the data gathered, was used only
 the main text content (column u). Text content averages 800 tokens in length, but with high variance, up to 5,000 tokens.
@@ -197,8 +197,8 @@ The following assumptions were considered:
 - Preprocessing experiments compare accuracy in a shallow neural network (SNN);
 - Pre-processing was investigated for the classification goal.
-From the Database obtained in Meta 4, stored in the project's [GitHub](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/scraps-desenvolvimento/Rotulagem/db_PPF_validacao_para%20UNB_%20FINAL.xlsx), a Notebook was developed in [Google Colab](https://colab.research.google.com)
-to implement the [pre-processing code](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-processamento/Pre_Processamento/MCTI_PPF_Pr%C3%A9_processamento.ipynb), which also can be found on the project's GitHub.
 Several Python packages were used to develop the preprocessing code:
@@ -235,8 +235,8 @@ Table 4: Preprocessing methods evaluated
 First, the treatment of punctuation and  capitalization was evaluated. This phase  resulted in the construction and
 evaluation of the first four bases (xp1, xp2, xp3, xp4).
-Then, the content simplification was evaluated, from the xp4 base, considering stemming (xp5),  stemming (xp6),
-stemming + stopwords removal (xp7), and stemming + stopwords removal (xp8).
 All eight bases were evaluated to classify the  eligibility of the opportunity, through the  training of a shallow
 neural network  (SNN – Shallow Neural Network).  The metrics for the eight bases were evaluated. The results are
@@ -262,7 +262,7 @@ document-embedding). The training time is so close that it did not have such a l
 As a last step, a spreadsheet was generated for the model (xp8) with the fields opo_pre and opo_pre_tkn, containing the
 preprocessed text in sentence format and tokens, respectively. This [database](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-processamento/Pre_Processamento/oportunidades_final_pre_processado.xlsx) was made
-available on the project's GitHub with the inclusion of columns \\opo_pre (text)\\ and \\opo_pre_tkn (tokenized)\\.
 ### Pretraining

 ## Training data
+The [inputted training data](https://github.com/chap0lin/PPF-MCTI/tree/master/Datasets) was obtained from scrapping techniques, over 30 different platforms e.g. The Royal Society,
 Annenberg foundation, and contained 928 labeled entries (928 rows x 21 columns). Of the data gathered, was used only
 the main text content (column u). Text content averages 800 tokens in length, but with high variance, up to 5,000 tokens.
 - Preprocessing experiments compare accuracy in a shallow neural network (SNN);
 - Pre-processing was investigated for the classification goal.
+From the Database obtained in Goal 4, stored in the project's [GitHub](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/scraps-desenvolvimento/Rotulagem/db_PPF_validacao_para%20UNB_%20FINAL.xlsx), a Notebook was developed in [Google Colab](https://colab.research.google.com)
+to implement the [preprocessing code](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-processamento/Pre_Processamento/MCTI_PPF_Pr%C3%A9_processamento.ipynb), which also can be found on the project's GitHub.
 Several Python packages were used to develop the preprocessing code:
 First, the treatment of punctuation and  capitalization was evaluated. This phase  resulted in the construction and
 evaluation of the first four bases (xp1, xp2, xp3, xp4).
+Then, the content simplification was evaluated, from the xp4 base, considering stemming (xp5),  Lemmatization (xp6),
+stemming + stopwords removal (xp7), and Lemmatization + stopwords removal (xp8).
 All eight bases were evaluated to classify the  eligibility of the opportunity, through the  training of a shallow
 neural network  (SNN – Shallow Neural Network).  The metrics for the eight bases were evaluated. The results are
 As a last step, a spreadsheet was generated for the model (xp8) with the fields opo_pre and opo_pre_tkn, containing the
 preprocessed text in sentence format and tokens, respectively. This [database](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-processamento/Pre_Processamento/oportunidades_final_pre_processado.xlsx) was made
+available on the project's GitHub with the inclusion of columns \textbf{opo_pre} (text) and \textbf{opo_pre_tkn} (tokenized)\\.
 ### Pretraining