This is RuBERT model fine-tuned for sentiment classification of short Russian texts. The task is a multi-class classification with the following labels:
0: neutral
1: positive
2: negative
Label to Russian label:
neutral: нейтральный
positive: позитивный
negative: негативный
Usage
from transformers import pipeline
model = pipeline(model="seara/rubert-base-cased-russian-sentiment")
model("Привет, ты мне нравишься!")
# [{'label': 'positive', 'score': 0.9818321466445923}]
Dataset
This model was trained on the union of the following datasets:
- Kaggle Russian News Dataset
- Linis Crowd 2015
- Linis Crowd 2016
- RuReviews
- RuSentiment
An overview of the training data can be found on S. Smetanin Github repository.
Download links for all Russian sentiment datasets collected by Smetanin can be found in this repository.
Training
Training were done in this project with this parameters:
tokenizer.max_length: 256
batch_size: 32
optimizer: adam
lr: 0.00001
weight_decay: 0
epochs: 2
Train/validation/test splits are 80%/10%/10%.
Eval results (on test split)
neutral | positive | negative | macro avg | weighted avg | |
---|---|---|---|---|---|
precision | 0.72 | 0.85 | 0.75 | 0.77 | 0.77 |
recall | 0.75 | 0.84 | 0.72 | 0.77 | 0.77 |
f1-score | 0.73 | 0.84 | 0.73 | 0.77 | 0.77 |
auc-roc | 0.86 | 0.96 | 0.92 | 0.91 | 0.91 |
support | 5196 | 3831 | 3599 | 12626 | 12626 |
- Downloads last month
- 294
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.