karimbkh commited on
Commit
07fa9b4
1 Parent(s): 0b5dcad

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +77 -0
README.md ADDED
@@ -0,0 +1,77 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - yelp_review_full
5
+ language:
6
+ - en
7
+ metrics:
8
+ - accuracy
9
+ - f1
10
+ library_name: transformers
11
+ ---
12
+ # Model Card
13
+
14
+ ## Sentiment Analysis of Restaurant Reviews from Yelp Dataset
15
+
16
+ ### Overview
17
+
18
+ - **Task**: Sentiment classification of restaurant reviews from the Yelp dataset.
19
+ - **Model**: Fine-tuned BERT (Bidirectional Encoder Representations from Transformers) for sequence classification.
20
+ - **Training Dataset**: Yelp dataset containing restaurant reviews.
21
+ - **Training Framework**: PyTorch and Transformers library.
22
+
23
+ ### Model Details
24
+
25
+ - **Pre-trained Model**: BERT-base-uncased.
26
+ - **Input**: Cleaned and preprocessed restaurant reviews.
27
+ - **Output**: Binary classification (positive or negative sentiment).
28
+ - **Tokenization**: BERT tokenizer with a maximum sequence length of 240 tokens.
29
+ - **Optimizer**: AdamW with a learning rate of 3e-5.
30
+ - **Learning Rate Scheduler**: Linear scheduler with no warmup steps.
31
+ - **Loss Function**: CrossEntropyLoss.
32
+ - **Batch Size**: 16.
33
+ - **Number of Epochs**: 2.
34
+
35
+ ### Data Preprocessing
36
+
37
+ 1. Loaded Yelp reviews dataset and business dataset.
38
+ 2. Merged datasets on the "business_id" column.
39
+ 3. Removed unnecessary columns and duplicates.
40
+ 4. Translated star ratings into binary sentiment labels (positive or negative).
41
+ 5. Upsampled the minority class (negative sentiment) to address imbalanced data.
42
+ 6. Cleaned text data by removing non-letters, converting to lowercase, and tokenizing.
43
+
44
+ ### Model Training
45
+
46
+ 1. Split the dataset into training (70%), validation (15%), and test (15%) sets.
47
+ 2. Tokenized, padded, and truncated input sequences.
48
+ 3. Created attention masks to differentiate real tokens from padding.
49
+ 4. Fine-tuned BERT using the specified hyperparameters.
50
+ 5. Tracked training and validation accuracy and loss for each epoch.
51
+
52
+ ### Model Evaluation
53
+
54
+ 1. Achieved high accuracy and F1 scores on both the validation and test sets.
55
+ 2. Generalization observed, as the accuracy on the test set was similar to the validation set.
56
+ 3. The model showed improvement in validation loss, indicating no overfitting.
57
+
58
+ ### Model Deployment
59
+
60
+ 1. Saved the trained model and tokenizer.
61
+ 2. Published the model and tokenizer to the Hugging Face Model Hub.
62
+ 3. Demonstrated how to load and use the model for making predictions.
63
+
64
+ ### Model Performance
65
+
66
+ - **Validation Accuracy**: ≈ 97.5% - 97.8%
67
+ - **Test Accuracy**: ≈ 97.8%
68
+ - **F1 Score**: ≈ 97.8% - 97.9%
69
+
70
+ ### Limitations
71
+
72
+ - Excluding stopwords may impact contextual understanding, but it was necessary to handle token length limitations.
73
+ - Performance may vary on reviews in languages other than English.
74
+
75
+ ### Conclusion
76
+
77
+ The fine-tuned BERT model demonstrates robust sentiment analysis on Yelp restaurant reviews. Its high accuracy and F1 scores indicate effectiveness in capturing sentiment from user-generated content. The model is suitable for deployment in applications requiring sentiment classification for restaurant reviews.