embedding-model-16 / README.md

dieineb

Update README.md

a215b95 verified 7 months ago

preview code

raw

history blame

No virus

3.85 kB

	---
	license: apache-2.0
	datasets:
	- AiresPucrs/sentiment-analysis
	language:
	- en
	metrics:
	- accuracy
	library_name: keras
	---
	# english-embedding-vocabulary-16

	## Model Overview

	The english-embedding-vocabulary-16 is a language model for sentiment analysis.

	### Details

	- Size: 160,289 parameters
	- Model type: word embeddings
	- Optimizer: Adam
	- Number of Epochs: 20
	- Embedding size: 16
	- Hardware: Tesla V4
	- Emissions: Not measured
	- Total Energy Consumption: Not measured

	### How to Use

	To run inference on this model, you can use the following code snippet:

	```python
	import numpy as np
	import tensorflow as tf
	from huggingface_hub import hf_hub_download

	# Download the model
	hf_hub_download(repo_id="AiresPucrs/english-embedding-vocabulary-16",
	filename="english_embedding_vocabulary_16.keras",
	local_dir="./",
	repo_type="model"
	)

	# Download the embedding vocabulary txt file
	hf_hub_download(repo_id="AiresPucrs/english-embedding-vocabulary-16",
	filename="english_embedding_vocabulary.txt",
	local_dir="./",
	repo_type="model"
	)

	model = tf.keras.models.load_model('english_embedding_vocabulary_16.keras')

	# Compile the model
	model.compile(loss='binary_crossentropy',
	optimizer='adam',
	metrics=['accuracy'])

	with open('english_embedding_vocabulary.txt', encoding='utf-8') as fp:
	english_embedding_vocabulary = [line.strip() for line in fp]
	fp.close()

	embeddings = model.get_layer('embedding').get_weights()[0]

	words_embeddings = {}

	# iterating through the elements of list
	for i, word in enumerate(english_embedding_vocabulary):
	# here we skip the embedding/token 0 (""), because is just the PAD token.
	if i == 0:
	continue
	words_embeddings[word] = embeddings[i]

	print("Embeddings Dimensions: ", np.array(list(words_embeddings.values())).shape)
	print("Vocabulary Size: ", len(words_embeddings.keys()))
	```
	## Intended Use

	This model was created for research purposes only. We do not recommend any application of this model outside this scope.

	## Performance Metrics

	The model achieved an accuracy of 84% on validation data.

	## Training Data

	The model was trained using a dataset that was put together by combining several datasets for sentiment classification available on [Kaggle](https://www.kaggle.com/):

	- The `IMDB 50K` [dataset](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews?select=IMDB+Dataset.csv): _0K movie reviews for natural language processing or Text analytics._
	- The `Twitter US Airline Sentiment` [dataset](https://www.kaggle.com/datasets/crowdflower/twitter-airline-sentiment): _originated from the [Crowdflower's Data for Everyone library](http://www.crowdflower.com/data-for-everyone)._
	- Our `google_play_apps_review` _dataset: built using the `google_play_scraper` in [this notebook](https://github.com/Nkluge-correa/teeny-tiny_castle/blob/master/ML%20Explainability/NLP%20Interpreter%20(en)/scrape(en).ipynb)._
	- The `EcoPreprocessed` [dataset](https://www.kaggle.com/datasets/pradeeshprabhakar/preprocessed-dataset-sentiment-analysis): _scrapped amazon product reviews_.

	## Limitations

	We do not recommend using this model in real-world applications. It was solely developed for academic and educational purposes.

	## Cite as

	```latex
	@misc{teenytinycastle,
	doi = {10.5281/zenodo.7112065},
	url = {https://github.com/Nkluge-correa/teeny-tiny_castle},
	author = {Nicholas Kluge Corr{\^e}a},
	title = {Teeny-Tiny Castle},
	year = {2024},
	publisher = {GitHub},
	journal = {GitHub repository},
	}
	```

	## License

	This model is licensed under the Apache License, Version 2.0. See the [LICENSE](LICENSE) file for more details.