philipp-zettl commited on
Commit
6cc2b88
1 Parent(s): 62e2378

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +162 -14
README.md CHANGED
@@ -1,9 +1,22 @@
1
  ---
2
- library_name: transformers
 
 
3
  tags: []
 
 
 
 
 
 
 
 
 
 
 
4
  ---
5
 
6
- # Model Card for Model ID
7
 
8
  <!-- Provide a quick summary of what the model is/does. -->
9
 
@@ -15,21 +28,38 @@ tags: []
15
 
16
  <!-- Provide a longer summary of what this model is. -->
17
 
18
- This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
19
 
20
- - **Developed by:** [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
  - **Funded by [optional]:** [More Information Needed]
22
  - **Shared by [optional]:** [More Information Needed]
23
  - **Model type:** [More Information Needed]
24
- - **Language(s) (NLP):** [More Information Needed]
25
- - **License:** [More Information Needed]
26
- - **Finetuned from model [optional]:** [More Information Needed]
27
 
28
  ### Model Sources [optional]
29
 
30
  <!-- Provide the basic links for the model. -->
31
 
32
- - **Repository:** [More Information Needed]
33
  - **Paper [optional]:** [More Information Needed]
34
  - **Demo [optional]:** [More Information Needed]
35
 
@@ -41,7 +71,7 @@ This is the model card of a 🤗 transformers model that has been pushed on the
41
 
42
  <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
 
44
- [More Information Needed]
45
 
46
  ### Downstream Use [optional]
47
 
@@ -53,7 +83,11 @@ This is the model card of a 🤗 transformers model that has been pushed on the
53
 
54
  <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
 
56
- [More Information Needed]
 
 
 
 
57
 
58
  ## Bias, Risks, and Limitations
59
 
@@ -71,7 +105,28 @@ Users (both direct and downstream) should be made aware of the risks, biases and
71
 
72
  Use the code below to get started with the model.
73
 
74
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
75
 
76
  ## Training Details
77
 
@@ -79,7 +134,11 @@ Use the code below to get started with the model.
79
 
80
  <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
 
82
- [More Information Needed]
 
 
 
 
83
 
84
  ### Training Procedure
85
 
@@ -87,7 +146,24 @@ Use the code below to get started with the model.
87
 
88
  #### Preprocessing [optional]
89
 
90
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
91
 
92
 
93
  #### Training Hyperparameters
@@ -136,7 +212,79 @@ Use the code below to get started with the model.
136
 
137
  <!-- Relevant interpretability work for the model goes here -->
138
 
139
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
140
 
141
  ## Environmental Impact
142
 
 
1
  ---
2
+ language: multilingual
3
+ license: mit
4
+ library_name: torch
5
  tags: []
6
+ base_model: BAAI/bge-m3
7
+ datasets: philipp-zettl/GGU-xx
8
+ metrics:
9
+ - accuracy
10
+ - f1
11
+ - recall
12
+ model_name: GGU-CLF
13
+ pipeline_tag: text-classification
14
+ widget:
15
+ - name: test1
16
+ text: hello world
17
  ---
18
 
19
+ # Model Card for GGU-CLF
20
 
21
  <!-- Provide a quick summary of what the model is/does. -->
22
 
 
28
 
29
  <!-- Provide a longer summary of what this model is. -->
30
 
 
31
 
32
+ This is a simple classification model trained on a custom dataset.
33
+
34
+ Please note that this model, although it is implemented in the `transformers` library. Is not a usual transformer.
35
+ It combines the underlying embedding model with the required tokenizer into a simple-to-use pipeline for sequence classification.
36
+
37
+ It is used to classify user text into the following classes:
38
+ - 0: Greeting
39
+ - 1: Gratitude
40
+ - 2: Unknown
41
+
42
+ **Note**: To use this model please remember the following things
43
+
44
+ 1. The model is an XLMRoberta model based on [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3).
45
+ 2. The required tokenizer is baked into the classifier implementation.
46
+
47
+
48
+
49
+
50
+ - **Developed by:** [philipp-zettl](https://huggingface.co/philipp-zettl/)
51
  - **Funded by [optional]:** [More Information Needed]
52
  - **Shared by [optional]:** [More Information Needed]
53
  - **Model type:** [More Information Needed]
54
+ - **Language(s) (NLP):** multilingual
55
+ - **License:** mit
56
+ - **Finetuned from model [optional]:** BAAI/bge-m3
57
 
58
  ### Model Sources [optional]
59
 
60
  <!-- Provide the basic links for the model. -->
61
 
62
+ - **Repository:** [philipp-zettl/GGU-CLF](https://huggingface.co/philipp-zettl/GGU-CLF)
63
  - **Paper [optional]:** [More Information Needed]
64
  - **Demo [optional]:** [More Information Needed]
65
 
 
71
 
72
  <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
73
 
74
+ Use this model to classify messages from natural language chats.
75
 
76
  ### Downstream Use [optional]
77
 
 
83
 
84
  <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
85
 
86
+
87
+ The model was not trained on multi-sentence samples. **You should avoid those.**
88
+
89
+ Oficially tested and supported languages are **english and german** any other language is considered out of scope.
90
+
91
 
92
  ## Bias, Risks, and Limitations
93
 
 
105
 
106
  Use the code below to get started with the model.
107
 
108
+
109
+ ```python
110
+ from transformers import AutoModel
111
+
112
+ model = AutoModel.from_pretrained("philipp-zettl/GGU-xx").to(torch.float16).to('cuda')
113
+
114
+ model([
115
+ 'Hi wie gehts?',
116
+ 'Dannke dir mein freund!',
117
+ 'Merci freundchen, send mir mal ein paar Machine Learning jobs.',
118
+ 'Works as expected, cheers!',
119
+ 'How you doin my boy',
120
+ 'send me immediately some matching jobs, thanks',
121
+ 'wer's eigentlich tom selleck?',
122
+ 'sprichst du deutsch?',
123
+ 'sprechen sie deutsch sie hurensohn?',
124
+ 'vergeltsgott',
125
+ 'heidenei dank dir recht herzlich',
126
+ 'grazie mille bambino, come estas'
127
+ ]).argmax(dim=1)
128
+ ```
129
+
130
 
131
  ## Training Details
132
 
 
134
 
135
  <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
136
 
137
+
138
+ This model was trained using the [philipp-zettl/GGU-xx](https://huggingface.co/dataset/philipp-zettl/GGU-xx) dataset.
139
+
140
+ You can find it's performance metrics under [Evaluation Results](#evaluation-results).
141
+
142
 
143
  ### Training Procedure
144
 
 
146
 
147
  #### Preprocessing [optional]
148
 
149
+
150
+ The following code was used to create the data set as well as split the data set into training and validation sets.
151
+
152
+ ```python
153
+ from datasets import load_dataset
154
+
155
+ class Dataset:
156
+ def __init__(self, dataset, target_names=None):
157
+ self.data = list(map(lambda x: x[0], dataset))
158
+ self.target = list(map(lambda x: x[1], dataset))
159
+ self.target_names = target_names
160
+
161
+
162
+ ds = load_dataset('philipp-zettl/GGU-xx')
163
+ data = Dataset([[e['sample'], e['label']] for e in ds['train']], ['greeting', 'gratitude', 'unknown'])
164
+ X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)
165
+ ```
166
+
167
 
168
 
169
  #### Training Hyperparameters
 
212
 
213
  <!-- Relevant interpretability work for the model goes here -->
214
 
215
+
216
+ You can find the initial implementation of the classification model here:
217
+
218
+ ```python
219
+ from transformers import PreTrainedModel, PretrainedConfig, AutoModel, AutoTokenizer
220
+ import torch
221
+ import torch.nn as nn
222
+
223
+
224
+ class EmbeddingClassifierConfig(PretrainedConfig):
225
+ model_type = 'xlm-roberta'
226
+
227
+ def __init__(self, num_classes=3, base_model='BAAI/bge-m3', tokenizer='BAAI/bge-m3', dropout=0.0, l2_reg=0.01, torch_dtype=torch.float16, **kwargs):
228
+ self.num_classes = num_classes
229
+ self.base_model = base_model
230
+ self.tokenizer = tokenizer
231
+ self.dropout = dropout
232
+ self.l2_reg = l2_reg
233
+ self.torch_dtype = torch_dtype
234
+ super().__init__(**kwargs)
235
+
236
+
237
+ class EmbeddingClassifier(PreTrainedModel):
238
+ config_class = EmbeddingClassifierConfig
239
+
240
+ def __init__(self, config):
241
+ super().__init__(config)
242
+ base_model = config.base_model
243
+ tokenizer = config.tokenizer
244
+
245
+ if base_model is None or isinstance(tokenizer, str):
246
+ base_model = AutoModel.from_pretrained(base_model)#, torch_dtype=config.torch_dtype)
247
+ if tokenizer is None or isinstance(tokenizer, str):
248
+ tokenizer = AutoTokenizer.from_pretrained(tokenizer)
249
+
250
+ self.tokenizer = tokenizer
251
+ self.base = base_model
252
+ self.fc = nn.Linear(base_model.config.hidden_size, config.num_classes)#, torch_dtype=config.torch_dtype)
253
+ self.do = nn.Dropout(config.dropout)#, torch_dtype=config.torch_dtype)
254
+ self.l2_reg = config.l2_reg
255
+
256
+ self.to(config.torch_dtype)
257
+
258
+ def forward(self, X):
259
+ encoding = self.tokenizer(
260
+ X, return_tensors='pt',
261
+ padding=True, truncation=True
262
+ ).to(self.device)
263
+ input_ids = encoding['input_ids']
264
+ attention_mask = encoding['attention_mask']
265
+ emb = self.base(
266
+ input_ids,
267
+ attention_mask=attention_mask,
268
+ return_dict=True,
269
+ output_hidden_states=True
270
+ ).last_hidden_state[:, 0, :]
271
+ return self.fc(self.do(emb))
272
+
273
+ def train(self, set_val=True):
274
+ self.base.train(False)
275
+ for param in self.base.parameters():
276
+ param.requires_grad = False
277
+ for param in self.fc.parameters():
278
+ param.requires_grad = set_val
279
+
280
+ def get_l2_loss(self):
281
+ l2_loss = torch.tensor(0.).to('cuda')
282
+ for param in self.parameters():
283
+ if param.requires_grad:
284
+ l2_loss += torch.norm(param, 2)
285
+ return self.l2_reg * l2_loss
286
+ ```
287
+
288
 
289
  ## Environmental Impact
290