|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- AiresPucrs/sentiment-analysis |
|
language: |
|
- en |
|
metrics: |
|
- accuracy |
|
library_name: keras |
|
--- |
|
# english-embedding-vocabulary-16 |
|
|
|
## Model Overview |
|
|
|
The english-embedding-vocabulary-16 is a language model for sentiment analysis. |
|
|
|
### Details |
|
|
|
- **Size:** 160,289 parameters |
|
- **Model type:** word embeddings |
|
- **Optimizer**: Adam |
|
- **Number of Epochs:** 20 |
|
- **Embedding size:** 16 |
|
- **Hardware:** Tesla V4 |
|
- **Emissions:** Not measured |
|
- **Total Energy Consumption:** Not measured |
|
|
|
### How to Use |
|
|
|
To run inference on this model, you can use the following code snippet: |
|
|
|
```python |
|
import numpy as np |
|
import tensorflow as tf |
|
from huggingface_hub import hf_hub_download |
|
|
|
# Download the model |
|
hf_hub_download(repo_id="AiresPucrs/english-embedding-vocabulary-16", |
|
filename="english_embedding_vocabulary_16.keras", |
|
local_dir="./", |
|
repo_type="model" |
|
) |
|
|
|
# Download the embedding vocabulary txt file |
|
hf_hub_download(repo_id="AiresPucrs/english-embedding-vocabulary-16", |
|
filename="english_embedding_vocabulary.txt", |
|
local_dir="./", |
|
repo_type="model" |
|
) |
|
|
|
model = tf.keras.models.load_model('english_embedding_vocabulary_16.keras') |
|
|
|
# Compile the model |
|
model.compile(loss='binary_crossentropy', |
|
optimizer='adam', |
|
metrics=['accuracy']) |
|
|
|
with open('english_embedding_vocabulary.txt', encoding='utf-8') as fp: |
|
english_embedding_vocabulary = [line.strip() for line in fp] |
|
fp.close() |
|
|
|
embeddings = model.get_layer('embedding').get_weights()[0] |
|
|
|
words_embeddings = {} |
|
|
|
# iterating through the elements of list |
|
for i, word in enumerate(english_embedding_vocabulary): |
|
# here we skip the embedding/token 0 (""), because is just the PAD token. |
|
if i == 0: |
|
continue |
|
words_embeddings[word] = embeddings[i] |
|
|
|
print("Embeddings Dimensions: ", np.array(list(words_embeddings.values())).shape) |
|
print("Vocabulary Size: ", len(words_embeddings.keys())) |
|
``` |
|
## Intended Use |
|
|
|
This model was created for research purposes only. We do not recommend any application of this model outside this scope. |
|
|
|
## Performance Metrics |
|
|
|
The model achieved an accuracy of 84% on validation data. |
|
|
|
## Training Data |
|
|
|
The model was trained using a dataset that was put together by combining several datasets for sentiment classification available on [Kaggle](https://www.kaggle.com/): |
|
|
|
- The `IMDB 50K` [dataset](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews?select=IMDB+Dataset.csv): _0K movie reviews for natural language processing or Text analytics._ |
|
- The `Twitter US Airline Sentiment` [dataset](https://www.kaggle.com/datasets/crowdflower/twitter-airline-sentiment): _originated from the [Crowdflower's Data for Everyone library](http://www.crowdflower.com/data-for-everyone)._ |
|
- Our `google_play_apps_review` _dataset: built using the `google_play_scraper` in [this notebook](https://github.com/Nkluge-correa/teeny-tiny_castle/blob/master/ML%20Explainability/NLP%20Interpreter%20(en)/scrape(en).ipynb)._ |
|
- The `EcoPreprocessed` [dataset](https://www.kaggle.com/datasets/pradeeshprabhakar/preprocessed-dataset-sentiment-analysis): _scrapped amazon product reviews_. |
|
|
|
## Limitations |
|
|
|
We do not recommend using this model in real-world applications. It was solely developed for academic and educational purposes. |
|
|
|
## Cite as |
|
|
|
```latex |
|
@misc{teenytinycastle, |
|
doi = {10.5281/zenodo.7112065}, |
|
url = {https://github.com/Nkluge-correa/teeny-tiny_castle}, |
|
author = {Nicholas Kluge Corr{\^e}a}, |
|
title = {Teeny-Tiny Castle}, |
|
year = {2024}, |
|
publisher = {GitHub}, |
|
journal = {GitHub repository}, |
|
} |
|
``` |
|
|
|
## License |
|
|
|
This model is licensed under the Apache License, Version 2.0. See the [LICENSE](LICENSE) file for more details. |