MarcosDib commited on
Commit
5abd0fc
·
1 Parent(s): 9f6fdc6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -6
README.md CHANGED
@@ -180,7 +180,7 @@ This bias will also affect all fine-tuned versions of this model.
180
 
181
  ## Training data
182
 
183
- The [inputted training](https://github.com/chap0lin/PPF-MCTI/tree/master/Datasets) data was obtained from scrapping techniques, over 30 different platforms e.g. The Royal Society,
184
  Annenberg foundation, and contained 928 labeled entries (928 rows x 21 columns). Of the data gathered, was used only
185
  the main text content (column u). Text content averages 800 tokens in length, but with high variance, up to 5,000 tokens.
186
 
@@ -197,8 +197,8 @@ The following assumptions were considered:
197
  - Preprocessing experiments compare accuracy in a shallow neural network (SNN);
198
  - Pre-processing was investigated for the classification goal.
199
 
200
- From the Database obtained in Meta 4, stored in the project's [GitHub](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/scraps-desenvolvimento/Rotulagem/db_PPF_validacao_para%20UNB_%20FINAL.xlsx), a Notebook was developed in [Google Colab](https://colab.research.google.com)
201
- to implement the [pre-processing code](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-processamento/Pre_Processamento/MCTI_PPF_Pr%C3%A9_processamento.ipynb), which also can be found on the project's GitHub.
202
 
203
  Several Python packages were used to develop the preprocessing code:
204
 
@@ -235,8 +235,8 @@ Table 4: Preprocessing methods evaluated
235
  First, the treatment of punctuation and capitalization was evaluated. This phase resulted in the construction and
236
  evaluation of the first four bases (xp1, xp2, xp3, xp4).
237
 
238
- Then, the content simplification was evaluated, from the xp4 base, considering stemming (xp5), stemming (xp6),
239
- stemming + stopwords removal (xp7), and stemming + stopwords removal (xp8).
240
 
241
  All eight bases were evaluated to classify the eligibility of the opportunity, through the training of a shallow
242
  neural network (SNN – Shallow Neural Network). The metrics for the eight bases were evaluated. The results are
@@ -262,7 +262,7 @@ document-embedding). The training time is so close that it did not have such a l
262
 
263
  As a last step, a spreadsheet was generated for the model (xp8) with the fields opo_pre and opo_pre_tkn, containing the
264
  preprocessed text in sentence format and tokens, respectively. This [database](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-processamento/Pre_Processamento/oportunidades_final_pre_processado.xlsx) was made
265
- available on the project's GitHub with the inclusion of columns \\opo_pre (text)\\ and \\opo_pre_tkn (tokenized)\\.
266
 
267
  ### Pretraining
268
 
 
180
 
181
  ## Training data
182
 
183
+ The [inputted training data](https://github.com/chap0lin/PPF-MCTI/tree/master/Datasets) was obtained from scrapping techniques, over 30 different platforms e.g. The Royal Society,
184
  Annenberg foundation, and contained 928 labeled entries (928 rows x 21 columns). Of the data gathered, was used only
185
  the main text content (column u). Text content averages 800 tokens in length, but with high variance, up to 5,000 tokens.
186
 
 
197
  - Preprocessing experiments compare accuracy in a shallow neural network (SNN);
198
  - Pre-processing was investigated for the classification goal.
199
 
200
+ From the Database obtained in Goal 4, stored in the project's [GitHub](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/scraps-desenvolvimento/Rotulagem/db_PPF_validacao_para%20UNB_%20FINAL.xlsx), a Notebook was developed in [Google Colab](https://colab.research.google.com)
201
+ to implement the [preprocessing code](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-processamento/Pre_Processamento/MCTI_PPF_Pr%C3%A9_processamento.ipynb), which also can be found on the project's GitHub.
202
 
203
  Several Python packages were used to develop the preprocessing code:
204
 
 
235
  First, the treatment of punctuation and capitalization was evaluated. This phase resulted in the construction and
236
  evaluation of the first four bases (xp1, xp2, xp3, xp4).
237
 
238
+ Then, the content simplification was evaluated, from the xp4 base, considering stemming (xp5), Lemmatization (xp6),
239
+ stemming + stopwords removal (xp7), and Lemmatization + stopwords removal (xp8).
240
 
241
  All eight bases were evaluated to classify the eligibility of the opportunity, through the training of a shallow
242
  neural network (SNN – Shallow Neural Network). The metrics for the eight bases were evaluated. The results are
 
262
 
263
  As a last step, a spreadsheet was generated for the model (xp8) with the fields opo_pre and opo_pre_tkn, containing the
264
  preprocessed text in sentence format and tokens, respectively. This [database](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-processamento/Pre_Processamento/oportunidades_final_pre_processado.xlsx) was made
265
+ available on the project's GitHub with the inclusion of columns \textbf{opo_pre} (text) and \textbf{opo_pre_tkn} (tokenized)\\.
266
 
267
  ### Pretraining
268