unb-lamfo-nlp-mcti
/

NLP-Classification-MCTI

English

Clsssification

science

Model card Files Files and versions Community

MarcosDib commited on Dec 13, 2022

Commit

f7c57c4

1 Parent(s): e12ab8c

Update README.md

Browse files

Files changed (1) hide show

README.md +79 -46

README.md CHANGED Viewed

@@ -19,7 +19,7 @@ thumbnail: https://github.com/Marcosdib/S2Query/Classification_Architecture_mode
 ![MCTIimg](https://antigo.mctic.gov.br/mctic/export/sites/institucional/institucional/entidadesVinculadas/conselhos/pag-old/RODAPE_MCTI.png)
-# MCTI Text Classification Task (uncased) DRAFT
 Disclaimer: The Brazilian Ministry of Science, Technology, and Innovation (MCTI) has partially supported this project.
@@ -38,24 +38,28 @@ Transformer-based approach, the Word2Vec-based approach improved the accuracy ra
 ## Model description
-Nullam congue hendrerit turpis et facilisis. Cras accumsan ante mi, eu hendrerit nulla finibus at. Donec imperdiet,
-nisi nec pulvinar suscipit, dolor nulla sagittis massa, et vehicula ante felis quis nibh. Lorem ipsum dolor sit amet,
-consectetur adipiscing elit. Maecenas viverra tempus risus non ornare. Donec in vehicula est. Pellentesque vulputate
-bibendum cursus. Nunc volutpat vitae neque ut bibendum:
-- Nullam congue hendrerit turpis et facilisis. Cras accumsan ante mi, eu hendrerit nulla finibus at. Donec imperdiet,
-  nisi nec pulvinar suscipit, dolor nulla sagittis massa, et vehicula ante felis quis nibh. Lorem ipsum dolor sit amet,
-  consectetur adipiscing elit.
-- Nullam congue hendrerit turpis et facilisis. Cras accumsan ante mi, eu hendrerit nulla finibus at. Donec imperdiet,
-  nisi nec pulvinar suscipit, dolor nulla sagittis massa, et vehicula ante felis quis nibh. Lorem ipsum dolor sit amet,
-  consectetur adipiscing elit.
-Nullam congue hendrerit turpis et facilisis. Cras accumsan ante mi, eu hendrerit nulla finibus at. Donec imperdiet,
-nisi nec pulvinar suscipit, dolor nulla sagittis massa, et vehicula ante felis quis nibh. Lorem ipsum dolor sit amet,
-consectetur adipiscing elit. Maecenas viverra tempus risus non ornare. Donec in vehicula est. Pellentesque vulputate
-bibendum cursus. Nunc volutpat vitae neque ut bibendum.
-![architeru](https://github.com/marcosdib/S2Query/Classification_Architecture_model.png)
 ## Model variations
@@ -74,30 +78,9 @@ Table 1: Templates using Word2Vec and Longformer
 | Longformer                   | 10.9GB  |
 | Word2Vec                     | 56.1MB  |
-| Keras Embedding + SNN	| 92.47	| 88.46	| 79.66	| 100	| 0.2	| 0.7	| 1.8	|
-| Keras Embedding + DNN	| 89.78	| 84.41	| 77.81	| 92.57	| 1	| 1.4	| 7.6	|
-| Keras Embedding + CNN	| 93.01	| 89.91	| 85.18	| 95.69	| 0.4	| 1.1	| 3.2	|
-| Keras Embedding + LSTM| 93.01	| 88.94	| 83.32	| 95.54	| 1.4	| 2	| 1.8	|
-| Word2Vec + SNN	| 89.25	| 83.82	| 74.15	| 97.10	| 1.4	| 1.2	| 9.6	|
-| Word2Vec + DNN	| 90.32	| 86.52	| 85.18	| 88.70	| 2	| 6.8	| 7.8	|
-| Word2Vec + CNN	| 92.47	| 88.42	| 80.85	| 98.72	| 1.9	| 3.4	| 4.7	|
-| Word2Vec + LSTM	| 89.78	| 84.36	| 75.36	| 95.81	| 2.6	| 14.3	| 1.2	|
-| Longformer + SNN	| 61.29	| 0	| 0	| 0	| 128	| 1.5	| 36.8	|
-| Longformer + DNN	| 91.93	| 87.62	| 80.37	| 97.62	| 81	| 8.4	| 12.7	|
-| Longformer + CNN	| 94.09	| 90.69	| 83.41	| 100	| 57	| 4.5	| 9.6	|
-| Longformer + LSTM	| 61.29	| 0	| 0	| 0	| 135	| 8.6	| 2.6	|
 ## Intended uses
-You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to
-be fine-tuned on a downstream task. See the [model hub](https://www.google.com) to look for
-fine-tuned versions of a task that interests you.
-Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked)
-to make decisions, such as sequence classification, token classification or question answering. For tasks such as text
-generation you should look at model like XXX.
 ### How to use
@@ -125,6 +108,15 @@ This model is uncased: it does not make a difference between english and English
 Even if the training data used for this model could be characterized as fairly neutral, this model can have biased
 predictions:
 -
 -
 This bias will also affect all fine-tuned versions of this model.
@@ -144,14 +136,6 @@ it was coupled to the classification model to train it with the labeled data in
 obtained with related metrics. With this implementation, was reached new levels of accuracy with 86% for CNN architecture
 and 88% for the LSTM architecture.
-Table 6: Results from Pre-trained WE + ML models
-| ML Model |  Accuracy | F1 Score  | Precision |   Recall  |
-|:--------:|:---------:|:---------:|:---------:|:---------:|
-| NN       |  0.8269   |  0.8545   |  0.8392   |  0.8712   |
-| DNN      |  0.7115   |  0.7794   |  0.7255   |  0.8485   |
-| CNN      |  0.8654   |  0.9083   |  0.8486   |  0.9773   |
-| LSTM     |  0.8846   |  0.9139   |  0.9056   |  0.9318   |
 ### Preprocessing
 Pre-processing was used to standardize the texts for the English language, reduce the number of insignificant tokens and
@@ -250,9 +234,58 @@ Table 5: Compatibility results (*base = labeled MCTI dataset entries)
 | BBC News Articles                    | 56.77%                 |
 | New unlabeled MCTI                   | 75.26%                 |
-## Evaluation results
 ## Benchmarks

 ![MCTIimg](https://antigo.mctic.gov.br/mctic/export/sites/institucional/institucional/entidadesVinculadas/conselhos/pag-old/RODAPE_MCTI.png)
+# MCTI Text Classification Task (uncased)
 Disclaimer: The Brazilian Ministry of Science, Technology, and Innovation (MCTI) has partially supported this project.
 ## Model description
+After the embedding, which is just essentially data preprocessing, it is necessary to develop the Project
+further to analyze the input text and classify whether it is a valid research funding opportunity for
+Brazilian or not.
+For the project, the best option would be chosen empirically upon comparing the results of 4 distinct architectures:
+Neural Network (NN), Deep Neural Network (DNN), Long Short-Term Memory (LSTM), and Convolutional Neural Network (CNN).
+Figure 4 shows the structure of the models.
+A neural network (NN) here is a simple feedforward neural network with only a single hidden layer, usually called
+”shallow”.  Shallow NNs are often limited in the complexity of the problems they can be trained to solve well.
+Our CNN model uses a dropout layer feeding into a couple of Conv1D layers and then a MaxPooling layer. After that,
+we Figure 4: Classification models use a hidden layer composed of a dense layer of size 128, followed by another
+dropout layer, and finally, the Flatten and final dense classification layer.
+The architecture of the CNN network used is composed of a 50% dropout layer followed by two 1D convolution
+layers associated with a MaxPooling layer. After max pooling a dense layer of size 128 was added connected
+to a 50% dropout which finally connects to a flatten layer and the final classification dense layer. Dropout
+layers help to avoid overfitting the network by masking part of the data so that the network learns to create
+redundancies in the analysis of the inputs.
+![CNN Classification Model](https://raw.githubusercontent.com/chap0lin/WEBIST2022/master/Assets/cnn_model.png)
 ## Model variations
 | Longformer                   | 10.9GB  |
 | Word2Vec                     | 56.1MB  |
 ## Intended uses
 ### How to use
 Even if the training data used for this model could be characterized as fairly neutral, this model can have biased
 predictions:
+Performance limiting: Loading the longformer model in memory means needing 11Gb available only for the model,
+without considering the weight of the deep learning network. For training this means we need a 20+ Gb GPU to
+perform the training. Here this was resolved using the high RAM environment of google Colab Pro and training
+using CPU which justifies the longer training time per season.
+Replicability limitation: Due to the simplicity of the keras embedding model, we are using one hot encoding,
+and it has a delicate problem for replication in production. This detail is pending further study to define
+whether it is possible to use one of these models.
 -
 -
 This bias will also affect all fine-tuned versions of this model.
 obtained with related metrics. With this implementation, was reached new levels of accuracy with 86% for CNN architecture
 and 88% for the LSTM architecture.
 ### Preprocessing
 Pre-processing was used to standardize the texts for the English language, reduce the number of insignificant tokens and
 | BBC News Articles                    | 56.77%                 |
 | New unlabeled MCTI                   | 75.26%                 |
+Table 6: Results from Pre-trained WE + ML models
+| ML Model |  Accuracy | F1 Score  | Precision |   Recall  |
+|:--------:|:---------:|:---------:|:---------:|:---------:|
+| NN       |  0.8269   |  0.8545   |  0.8392   |  0.8712   |
+| DNN      |  0.7115   |  0.7794   |  0.7255   |  0.8485   |
+| CNN      |  0.8654   |  0.9083   |  0.8486   |  0.9773   |
+| LSTM     |  0.8846   |  0.9139   |  0.9056   |  0.9318   |
+## Evaluation results
+The table below presents the results of accuracy, f1-score, recall and precision obtained in the training of each network.
+In addition, the necessary times for training each epoch, the data validation execution time and the weight of the deep
+learning model associated with each implementation were added.
+Table 7: Results of experiments
+| Model                  | Accuracy | F1-score | Recall | Precision | Training time epoch(s) | Validation time (s) | Weight(MB) |
+|------------------------|----------|----------|--------|-----------|------------------------|---------------------|------------|
+| Keras Embedding + SNN  |    92.47 |    88.46 |  79.66 |    100.00 |                    0.2 |                 0.7 |        1.8 |
+| Keras Embedding + DNN  |    89.78 |    84.41 |  77.81 |     92.57 |                    1.0 |                 1.4 |        7.6 |
+| Keras Embedding + CNN  |    93.01 |    89.91 |  85.18 |     95.69 |                    0.4 |                 1.1 |        3.2 |
+| Keras Embedding + LSTM |    93.01 |    88.94 |  83.32 |     95.54 |                    1.4 |                 2.0 |        1.8 |
+| Word2Vec + SNN         |    89.25 |    83.82 |  74.15 |     97.10 |                    1.4 |                 1.2 |        9.6 |
+| Word2Vec + DNN         |    90.32 |    86.52 |  85.18 |     88.70 |                    2.0 |                 6.8 |        7.8 |
+| Word2Vec + CNN         |    92.47 |    88.42 |  80.85 |     98.72 |                    1.9 |                 3.4 |        4.7 |
+| Word2Vec + LSTM        |    89.78 |    84.36 |  75.36 |     95.81 |                    2.6 |                14.3 |        1.2 |
+| Longformer + SNN       |    61.29 |        0 |      0 |         0 |                  128.0 |                 1.5 |       36.8 |
+| Longformer + DNN       |    91.93 |    87.62 |  80.37 |     97.62 |                   81.0 |                 8.4 |       12.7 |
+| Longformer + CNN       |    94.09 |    90.69 |  83.41 |    100.00 |                   57.0 |                 4.5 |        9.6 |
+| Longformer + LSTM      |    61.29 |        0 |      0 |         0 |                   13.0 |                 8.6 |        2.6 |
+The results obtained surpassed those achieved in goal 6 and goal 9, with the best accuracy obtained of 94%
+in the longformer + CNN model. We can also observe that the models that achieved the best results were those
+that used the CNN network for deep learning.
+In addition, it was possible to notice that the model of longformer + SNN and longformer + LSTM were not able
+to learn. Perhaps the models need some adjustments, but each training attempt took between 5 and 8 hours, which
+made it impossible to try to adjust when other models were already showing promising results.
+Above the results obtained, it is also necessary to highlight two limitations found for the replication and
+training of networks:
+These 10Gb of the model exceed the Github limit and did not go to the repository, so to run the system we need
+to download the pre-trained network in the notebook and run the encoder-decoder with the data to create the model.
+It is advisable to do this in a GPU environment and save the file on the drive. After that change the environment to
+CPU to perform the training. Trying to generate the model in CPU will take more than 3 hours of processing.
+The best model that does not have any limitations is Word2Vec + CNN. However, we need to study the limitations to
+understand whether it is possible to introduce a new model with better accuracy and indicators. These adjustments
+will be worked on during goals 13 and 14 where the main objective will be to encapsulate the solution in the most
+suitable way for use in production.
 ## Benchmarks