lxyuan commited on
Commit
5b52a68
โ€ข
1 Parent(s): d8ed716

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +144 -0
README.md ADDED
@@ -0,0 +1,144 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - sentiment-analysis
5
+ - text-classification
6
+ - zero-shot-distillation
7
+ - distillation
8
+ - zero-shot-classification
9
+ - debarta-v3
10
+ model-index:
11
+ - name: distilbert-base-multilingual-cased-sentiments-student
12
+ results: []
13
+ datasets:
14
+ - tyqiangz/multilingual-sentiments
15
+ language:
16
+ - en
17
+ - ar
18
+ - de
19
+ - es
20
+ - fr
21
+ - ja
22
+ - zh
23
+ - id
24
+ - hi
25
+ - it
26
+ - ms
27
+ - pt
28
+ ---
29
+
30
+ <!-- This model card has been generated automatically according to the information the Trainer had access to. You
31
+ should probably proofread and complete it, then remove this comment. -->
32
+
33
+ # distilbert-base-multilingual-cased-sentiments-student
34
+
35
+ This model is distilled from the zero-shot classification pipeline on the Multilingual Sentiment
36
+ dataset using this [script](https://github.com/huggingface/transformers/tree/main/examples/research_projects/zero-shot-distillation).
37
+
38
+ In reality the multilingual-sentiment dataset is annotated of course,
39
+ but we'll pretend and ignore the annotations for the sake of example.
40
+
41
+
42
+ Teacher model: MoritzLaurer/mDeBERTa-v3-base-mnli-xnli
43
+ Teacher hypothesis template: "The sentiment of this text is {}."
44
+ Student model: distilbert-base-multilingual-cased
45
+
46
+
47
+ ## Inference example
48
+
49
+ ```python
50
+ from transformers import pipeline
51
+
52
+ distilled_student_sentiment_classifier = pipeline(
53
+ model="lxyuan/distilbert-base-multilingual-cased-sentiments-student",
54
+ return_all_scores=True
55
+ )
56
+
57
+ # english
58
+ distilled_student_sentiment_classifier ("I love this movie and i would watch it again and again!")
59
+ >> [[{'label': 'positive', 'score': 0.9731044769287109},
60
+ {'label': 'neutral', 'score': 0.016910076141357422},
61
+ {'label': 'negative', 'score': 0.009985478594899178}]]
62
+
63
+ # malay
64
+ distilled_student_sentiment_classifier("Saya suka filem ini dan saya akan menontonnya lagi dan lagi!")
65
+ [[{'label': 'positive', 'score': 0.9760093688964844},
66
+ {'label': 'neutral', 'score': 0.01804516464471817},
67
+ {'label': 'negative', 'score': 0.005945465061813593}]]
68
+
69
+ # japanese
70
+ distilled_student_sentiment_classifier("็งใฏใ“ใฎๆ˜ ็”ปใŒๅคงๅฅฝใใงใ€ไฝ•ๅบฆใ‚‚่ฆ‹ใพใ™๏ผ")
71
+ >> [[{'label': 'positive', 'score': 0.9342429041862488},
72
+ {'label': 'neutral', 'score': 0.040193185210227966},
73
+ {'label': 'negative', 'score': 0.025563929229974747}]]
74
+
75
+
76
+ ```
77
+
78
+
79
+ ## Training procedure
80
+
81
+ Notebook link: TBU
82
+
83
+ ### Training hyperparameters
84
+
85
+ Result can be reproduce using the following commands:
86
+
87
+ ```bash
88
+ python transformers/examples/research_projects/zero-shot-distillation/distill_classifier.py \
89
+ --data_file ./multilingual-sentiments/train_unlabeled.txt \
90
+ --class_names_file ./multilingual-sentiments/class_names.txt \
91
+ --hypothesis_template "The sentiment of this text is {}." \
92
+ --teacher_name_or_path MoritzLaurer/mDeBERTa-v3-base-mnli-xnli \
93
+ --teacher_batch_size 32 \
94
+ --student_name_or_path distilbert-base-multilingual-cased \
95
+ --output_dir ./distilbert-base-multilingual-cased-sentiments-student \
96
+ --per_device_train_batch_size 16 \
97
+ --fp16
98
+ ```
99
+
100
+ If you are training this model on Colab, make the following code changes to avoid Out-of-memory error message:
101
+ ```bash
102
+ ###### modify L78 to disable fast tokenizer
103
+ default=False,
104
+
105
+ ###### update dataset map part at L313
106
+ dataset = dataset.map(tokenizer, input_columns="text", fn_kwargs={"padding": "max_length", "truncation": True, "max_length": 512})
107
+
108
+ ###### add following lines to L213
109
+ del model
110
+ print(f"Manually deleted Teacher model, free some memory for student model.")
111
+
112
+ ###### add following lines to L337
113
+ trainer.push_to_hub()
114
+ tokenizer.push_to_hub("distilbert-base-multilingual-cased-sentiments-student")
115
+
116
+ ```
117
+
118
+ ### Training log
119
+ ```bash
120
+
121
+ Training completed. Do not forget to share your model on huggingface.co/models =)
122
+
123
+ {'train_runtime': 2009.8864, 'train_samples_per_second': 73.0, 'train_steps_per_second': 4.563, 'train_loss': 0.6473459283913797, 'epoch': 1.0}
124
+ 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 9171/9171 [33:29<00:00, 4.56it/s]
125
+ [INFO|trainer.py:762] 2023-05-06 10:56:18,555 >> The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`, you can safely ignore this message.
126
+ [INFO|trainer.py:3129] 2023-05-06 10:56:18,557 >> ***** Running Evaluation *****
127
+ [INFO|trainer.py:3131] 2023-05-06 10:56:18,557 >> Num examples = 146721
128
+ [INFO|trainer.py:3134] 2023-05-06 10:56:18,557 >> Batch size = 128
129
+ 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 1147/1147 [08:59<00:00, 2.13it/s]
130
+ 05/06/2023 11:05:18 - INFO - __main__ - Agreement of student and teacher predictions: 88.29%
131
+ [INFO|trainer.py:2868] 2023-05-06 11:05:18,251 >> Saving model checkpoint to ./distilbert-base-multilingual-cased-sentiments-student
132
+ [INFO|configuration_utils.py:457] 2023-05-06 11:05:18,251 >> Configuration saved in ./distilbert-base-multilingual-cased-sentiments-student/config.json
133
+ [INFO|modeling_utils.py:1847] 2023-05-06 11:05:18,905 >> Model weights saved in ./distilbert-base-multilingual-cased-sentiments-student/pytorch_model.bin
134
+ [INFO|tokenization_utils_base.py:2171] 2023-05-06 11:05:18,905 >> tokenizer config file saved in ./distilbert-base-multilingual-cased-sentiments-student/tokenizer_config.json
135
+ [INFO|tokenization_utils_base.py:2178] 2023-05-06 11:05:18,905 >> Special tokens file saved in ./distilbert-base-multilingual-cased-sentiments-student/special_tokens_map.json
136
+
137
+ ```
138
+
139
+ ### Framework versions
140
+
141
+ - Transformers 4.28.1
142
+ - Pytorch 2.0.0+cu118
143
+ - Datasets 2.11.0
144
+ - Tokenizers 0.13.3