mikhmanoff commited on
Commit
bb98109
1 Parent(s): 5329e67

Upload 9 files

Browse files
README.md CHANGED
@@ -1,3 +1,94 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - ru
5
+ metrics:
6
+ - f1
7
+ - roc_auc
8
+ - precision
9
+ - recall
10
+ pipeline_tag: text-classification
11
+ tags:
12
+ - sentiment-analysis
13
+ - multi-class-classification
14
+ - sentiment analysis
15
+ - rubert
16
+ - sentiment
17
+ - bert
18
+ - tiny
19
+ - russian
20
+ - multiclass
21
+ - classification
22
+ datasets:
23
+ - sismetanin/rureviews
24
+ - RuSentiment
25
+ - LinisCrowd2015
26
+ - LinisCrowd2016
27
+ - KaggleRussianNews
28
+ ---
29
+
30
+ This is [RuBERT-tiny2](https://huggingface.co/cointegrated/rubert-tiny2) model fine-tuned for __sentiment classification__ of short __Russian__ texts.
31
+ The task is a __multi-class classification__ with the following labels:
32
+
33
+ ```yaml
34
+ 0: neutral
35
+ 1: positive
36
+ 2: negative
37
+ ```
38
+
39
+ Label to Russian label:
40
+
41
+ ```yaml
42
+ neutral: нейтральный
43
+ positive: позитивный
44
+ negative: негативный
45
+ ```
46
+
47
+ ## Usage
48
+
49
+ ```python
50
+ from transformers import pipeline
51
+ model = pipeline(model="seara/rubert-tiny2-russian-sentiment")
52
+ model("Привет, ты мне нравишься!")
53
+ # [{'label': 'positive', 'score': 0.9398769736289978}]
54
+ ```
55
+
56
+ ## Dataset
57
+
58
+ This model was trained on the union of the following datasets:
59
+
60
+ - Kaggle Russian News Dataset
61
+ - Linis Crowd 2015
62
+ - Linis Crowd 2016
63
+ - RuReviews
64
+ - RuSentiment
65
+
66
+ An overview of the training data can be found on [S. Smetanin Github repository](https://github.com/sismetanin/sentiment-analysis-in-russian).
67
+
68
+ __Download links for all Russian sentiment datasets collected by Smetanin can be found in this [repository](https://github.com/searayeah/russian-sentiment-emotion-datasets).__
69
+
70
+ ## Training
71
+
72
+ Training were done in this [project](https://github.com/searayeah/bert-russian-sentiment-emotion) with this parameters:
73
+
74
+ ```yaml
75
+ tokenizer.max_length: 512
76
+ batch_size: 64
77
+ optimizer: adam
78
+ lr: 0.00001
79
+ weight_decay: 0
80
+ epochs: 5
81
+ ```
82
+
83
+ Train/validation/test splits are 80%/10%/10%.
84
+
85
+ ## Eval results (on test split)
86
+
87
+
88
+ | |neutral|positive|negative|macro avg|weighted avg|
89
+ |---------|-------|--------|--------|---------|------------|
90
+ |precision|0.7 |0.84 |0.74 |0.76 |0.75 |
91
+ |recall |0.74 |0.83 |0.69 |0.75 |0.75 |
92
+ |f1-score |0.72 |0.83 |0.71 |0.75 |0.75 |
93
+ |auc-roc |0.85 |0.95 |0.91 |0.9 |0.9 |
94
+ |support |5196 |3831 |3599 |12626 |12626 |
config.json ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "cointegrated/rubert-tiny2",
3
+ "architectures": [
4
+ "BertForSequenceClassification"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "classifier_dropout": null,
8
+ "emb_size": 312,
9
+ "gradient_checkpointing": false,
10
+ "hidden_act": "gelu",
11
+ "hidden_dropout_prob": 0.1,
12
+ "hidden_size": 312,
13
+ "id2label": {
14
+ "0": "neutral",
15
+ "1": "positive",
16
+ "2": "negative"
17
+ },
18
+ "initializer_range": 0.02,
19
+ "intermediate_size": 600,
20
+ "label2id": {
21
+ "negative": 2,
22
+ "neutral": 0,
23
+ "positive": 1
24
+ },
25
+ "layer_norm_eps": 1e-12,
26
+ "max_position_embeddings": 2048,
27
+ "model_type": "bert",
28
+ "num_attention_heads": 12,
29
+ "num_hidden_layers": 3,
30
+ "pad_token_id": 0,
31
+ "position_embedding_type": "absolute",
32
+ "problem_type": "single_label_classification",
33
+ "torch_dtype": "float32",
34
+ "transformers_version": "4.29.1",
35
+ "type_vocab_size": 2,
36
+ "use_cache": true,
37
+ "vocab_size": 83828
38
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:989424a83d6816d853571fd48926906f723c28d65b962d31ceca6029781c737e
3
+ size 116801868
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f6cd179bc20f33ddf17fd16318da2257574f831363d46d6a77a21faafb2f2c0f
3
+ size 116813471
special_tokens_map.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": "[CLS]",
3
+ "mask_token": "[MASK]",
4
+ "pad_token": "[PAD]",
5
+ "sep_token": "[SEP]",
6
+ "unk_token": "[UNK]"
7
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "clean_up_tokenization_spaces": true,
3
+ "cls_token": "[CLS]",
4
+ "do_basic_tokenize": true,
5
+ "do_lower_case": false,
6
+ "mask_token": "[MASK]",
7
+ "model_max_length": 2048,
8
+ "never_split": null,
9
+ "pad_token": "[PAD]",
10
+ "sep_token": "[SEP]",
11
+ "strip_accents": null,
12
+ "tokenize_chinese_chars": true,
13
+ "tokenizer_class": "BertTokenizer",
14
+ "unk_token": "[UNK]"
15
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff