Update README.md
Browse files
README.md
CHANGED
@@ -180,7 +180,7 @@ This bias will also affect all fine-tuned versions of this model.
|
|
180 |
|
181 |
## Training data
|
182 |
|
183 |
-
The [inputted training](https://github.com/chap0lin/PPF-MCTI/tree/master/Datasets)
|
184 |
Annenberg foundation, and contained 928 labeled entries (928 rows x 21 columns). Of the data gathered, was used only
|
185 |
the main text content (column u). Text content averages 800 tokens in length, but with high variance, up to 5,000 tokens.
|
186 |
|
@@ -197,8 +197,8 @@ The following assumptions were considered:
|
|
197 |
- Preprocessing experiments compare accuracy in a shallow neural network (SNN);
|
198 |
- Pre-processing was investigated for the classification goal.
|
199 |
|
200 |
-
From the Database obtained in
|
201 |
-
to implement the [
|
202 |
|
203 |
Several Python packages were used to develop the preprocessing code:
|
204 |
|
@@ -235,8 +235,8 @@ Table 4: Preprocessing methods evaluated
|
|
235 |
First, the treatment of punctuation and capitalization was evaluated. This phase resulted in the construction and
|
236 |
evaluation of the first four bases (xp1, xp2, xp3, xp4).
|
237 |
|
238 |
-
Then, the content simplification was evaluated, from the xp4 base, considering stemming (xp5),
|
239 |
-
stemming + stopwords removal (xp7), and
|
240 |
|
241 |
All eight bases were evaluated to classify the eligibility of the opportunity, through the training of a shallow
|
242 |
neural network (SNN – Shallow Neural Network). The metrics for the eight bases were evaluated. The results are
|
@@ -262,7 +262,7 @@ document-embedding). The training time is so close that it did not have such a l
|
|
262 |
|
263 |
As a last step, a spreadsheet was generated for the model (xp8) with the fields opo_pre and opo_pre_tkn, containing the
|
264 |
preprocessed text in sentence format and tokens, respectively. This [database](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-processamento/Pre_Processamento/oportunidades_final_pre_processado.xlsx) was made
|
265 |
-
available on the project's GitHub with the inclusion of columns
|
266 |
|
267 |
### Pretraining
|
268 |
|
|
|
180 |
|
181 |
## Training data
|
182 |
|
183 |
+
The [inputted training data](https://github.com/chap0lin/PPF-MCTI/tree/master/Datasets) was obtained from scrapping techniques, over 30 different platforms e.g. The Royal Society,
|
184 |
Annenberg foundation, and contained 928 labeled entries (928 rows x 21 columns). Of the data gathered, was used only
|
185 |
the main text content (column u). Text content averages 800 tokens in length, but with high variance, up to 5,000 tokens.
|
186 |
|
|
|
197 |
- Preprocessing experiments compare accuracy in a shallow neural network (SNN);
|
198 |
- Pre-processing was investigated for the classification goal.
|
199 |
|
200 |
+
From the Database obtained in Goal 4, stored in the project's [GitHub](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/scraps-desenvolvimento/Rotulagem/db_PPF_validacao_para%20UNB_%20FINAL.xlsx), a Notebook was developed in [Google Colab](https://colab.research.google.com)
|
201 |
+
to implement the [preprocessing code](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-processamento/Pre_Processamento/MCTI_PPF_Pr%C3%A9_processamento.ipynb), which also can be found on the project's GitHub.
|
202 |
|
203 |
Several Python packages were used to develop the preprocessing code:
|
204 |
|
|
|
235 |
First, the treatment of punctuation and capitalization was evaluated. This phase resulted in the construction and
|
236 |
evaluation of the first four bases (xp1, xp2, xp3, xp4).
|
237 |
|
238 |
+
Then, the content simplification was evaluated, from the xp4 base, considering stemming (xp5), Lemmatization (xp6),
|
239 |
+
stemming + stopwords removal (xp7), and Lemmatization + stopwords removal (xp8).
|
240 |
|
241 |
All eight bases were evaluated to classify the eligibility of the opportunity, through the training of a shallow
|
242 |
neural network (SNN – Shallow Neural Network). The metrics for the eight bases were evaluated. The results are
|
|
|
262 |
|
263 |
As a last step, a spreadsheet was generated for the model (xp8) with the fields opo_pre and opo_pre_tkn, containing the
|
264 |
preprocessed text in sentence format and tokens, respectively. This [database](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-processamento/Pre_Processamento/oportunidades_final_pre_processado.xlsx) was made
|
265 |
+
available on the project's GitHub with the inclusion of columns \textbf{opo_pre} (text) and \textbf{opo_pre_tkn} (tokenized)\\.
|
266 |
|
267 |
### Pretraining
|
268 |
|