edumunozsala's picture
Fix Metric Typos (#1)
a89bd5e
|
raw
history blame
3.25 kB
metadata
language: es
tags:
  - sagemaker
  - beto
  - TextClassification
  - SentimentAnalysis
license: apache-2.0
datasets:
  - IMDbreviews_es
metrics:
  - accuracy
model-index:
  - name: beto_sentiment_analysis_es
    results:
      - task:
          name: Sentiment Analysis
          type: sentiment-analysis
        dataset:
          name: IMDb Reviews in Spanish
          type: IMDbreviews_es
        metrics:
          - name: Accuracy
            type: accuracy
            value: 0.9101333333333333
          - name: F1 Score
            type: f1
            value: 0.9088450094671354
          - name: Precision
            type: precision
            value: 0.9105691056910569
          - name: Recall
            type: recall
            value: 0.9071274298056156
widget:
  - text: >-
      Se trata de una película interesante, con un solido argumento y un gran
      interpretación de su actor principal

Model beto_sentiment_analysis_es

A finetuned model for Sentiment analysis in Spanish

This model was trained using Amazon SageMaker and the new Hugging Face Deep Learning container, The base model is BETO which is a BERT-base model pre-trained on a spanish corpus. BETO is of size similar to a BERT-Base and was trained with the Whole Word Masking technique.

BETO Citation

Spanish Pre-Trained BERT Model and Evaluation Data

@inproceedings{CaneteCFP2020,
  title={Spanish Pre-Trained BERT Model and Evaluation Data},
  author={Cañete, José and Chaperon, Gabriel and Fuentes, Rodrigo and Ho, Jou-Hui and Kang, Hojin and Pérez, Jorge},
  booktitle={PML4DC at ICLR 2020},
  year={2020}
}

Dataset

The dataset is a collection of movie reviews in Spanish, about 50,000 reviews. The dataset is balanced and provides every review in english, in spanish and the label in both languages.

Sizes of datasets:

  • Train dataset: 42,500
  • Validation dataset: 3,750
  • Test dataset: 3,750

Intended uses & limitations

This model is intented for Sentiment Analysis for spanish corpus and finetuned specially for movie reviews but it can be applied to other kind of reviews.

Hyperparameters

{
"epochs": "4",
"train_batch_size": "32",    
"eval_batch_size": "8",
"fp16": "true",
"learning_rate": "3e-05",
"model_name": "\"dccuchile/bert-base-spanish-wwm-uncased\"",
"sagemaker_container_log_level": "20",
"sagemaker_program": "\"train.py\"",
}

Evaluation results

  • Accuracy = 0.9101333333333333

  • F1 Score = 0.9088450094671354

  • Precision = 0.9105691056910569

  • Recall = 0.9071274298056156

Test results

Model in action

Usage for Sentiment Analysis

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("edumunozsala/beto_sentiment_analysis_es")
model = AutoModelForSequenceClassification.from_pretrained("edumunozsala/beto_sentiment_analysis_es")

text ="Se trata de una película interesante, con un solido argumento y un gran interpretación de su actor principal"

input_ids = torch.tensor(tokenizer.encode(text)).unsqueeze(0)
outputs = model(input_ids)
output = outputs.logits.argmax(1)

Created by Eduardo Muñoz/@edumunozsala