File size: 6,106 Bytes
9afa0c2
2f8063a
 
 
 
 
 
 
 
 
90ee0c1
2f8063a
 
9afa0c2
c91fe92
2f8063a
 
9afa0c2
3350320
 
 
 
 
 
 
7fca91d
3350320
5b21a8d
49ebfbf
3350320
c4e1cea
 
 
 
 
 
 
 
 
 
 
8bfb1d4
3350320
 
 
 
 
5b21a8d
3350320
 
c4e1cea
 
 
 
 
 
 
 
 
 
 
 
 
3350320
 
 
 
 
1d7182a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3350320
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
82dad17
 
 
3350320
 
26b3c8f
 
3350320
 
 
 
 
 
 
 
 
 
 
 
 
 
79ec4c9
3350320
 
 
 
82dad17
26b3c8f
82dad17
5dde3eb
 
f5ede76
26b3c8f
a01ed1d
26b3c8f
3350320
82dad17
 
 
 
3350320
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
---
language: en
tags:
- text-classification
- onnx
- int8
- roberta
- emotions
- multi-class-classification
- multi-label-classification
- optimum
datasets:
- go_emotions
license: mit
inference: false
widget:
- text: Thank goodness ONNX is available, it is lots faster!
---

This model is the ONNX version of [https://huggingface.co/SamLowe/roberta-base-go_emotions](https://huggingface.co/SamLowe/roberta-base-go_emotions).

### Full precision ONNX version

`onnx/model.onnx` is the full precision ONNX version

- that has identical accuracy/metrics to the original Transformers model
- and has the same model size (499MB)
- is faster in inference than normal Transformers, particularly for smaller batch sizes
  - in my tests about 2x to 3x as fast for a batch size of 1 on a 8 core 11th gen i7 CPU using ONNXRuntime

#### Metrics

Using a fixed threshold of 0.5 to convert the scores to binary predictions for each label:

- Accuracy: 0.474
- Precision: 0.575
- Recall: 0.396
- F1: 0.450

See more details in the SamLowe/roberta-base-go_emotions model card for the increases possible through selecting label-specific thresholds to maximise F1 scores, or another metric.

### Quantized (INT8) ONNX version

`onnx/model_quantized.onnx` is the int8 quantized version 

- that is one quarter the size (125MB) of the full precision model (above)
- but delivers almost all of the accuracy
- is faster in inference than both the full precision ONNX above, and the normal Transformers model
  - about 2x as fast for a batch size of 1 on an 8 core 11th gen i7 CPU using ONNXRuntime vs the full precision model above
  - which makes it circa 5x as fast as the full precision normal Transformers model (on the above mentioned CPU, for a batch of 1)
 
#### Metrics for Quantized (INT8) Model

Using a fixed threshold of 0.5 to convert the scores to binary predictions for each label:

- Accuracy: 0.475
- Precision: 0.582
- Recall: 0.398
- F1: 0.447

Note how the metrics are almost identical to the full precision metrics above.

See more details in the SamLowe/roberta-base-go_emotions model card for the increases possible through selecting label-specific thresholds to maximise F1 scores, or another metric.

### How to use

#### Using Optimum Library ONNX Classes

Optimum library has equivalents (starting `ORT`) for the main Transformers classes, so these models can be used with the familiar constructs. The only extra property needed is `file_name` on the model creation, which in the below example specifies the quantized (INT8) model. 

```python
sentences = ["ONNX is seriously fast for small batches. Impressive"]

from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import ORTModelForSequenceClassification

model_id = "SamLowe/roberta-base-go_emotions-onnx"
file_name = "onnx/model_quantized.onnx"

model = ORTModelForSequenceClassification.from_pretrained(model_id, file_name=file_name)
tokenizer = AutoTokenizer.from_pretrained(model_id)

onnx_classifier = pipeline(
    task="text-classification",
    model=model,
    tokenizer=tokenizer,
    top_k=None,
    function_to_apply="sigmoid",  # optional as is the default for the task
)

model_outputs = onnx_classifier(sentences)
# gives a list of outputs, each a list of dicts (one per label)

print(model_outputs)
# E.g.
# [[{'label': 'admiration', 'score': 0.9203393459320068},
#   {'label': 'approval', 'score': 0.0560273639857769},
#   {'label': 'neutral', 'score': 0.04265536740422249},
#   {'label': 'gratitude', 'score': 0.015126707963645458},
# ...
```

#### Using ONNXRuntime

- Tokenization can be done before with the `tokenizers` library,
- and then the fed into ONNXRuntime as the type of dict it uses,
- and then simply the postprocessing sigmoid is needed afterward on the model output (which comes as a numpy array) to create the embeddings.

```python
from tokenizers import Tokenizer
import onnxruntime as ort

from os import cpu_count
import numpy as np  # only used for the postprocessing sigmoid

sentences = ["hello world"]  # for example a batch of 1

# labels as (ordered) list - from the go_emotions dataset
labels = ['admiration', 'amusement', 'anger', 'annoyance', 'approval', 'caring', 'confusion', 'curiosity', 'desire', 'disappointment', 'disapproval', 'disgust', 'embarrassment', 'excitement', 'fear', 'gratitude', 'grief', 'joy', 'love', 'nervousness', 'optimism', 'pride', 'realization', 'relief', 'remorse', 'sadness', 'surprise', 'neutral']

tokenizer = Tokenizer.from_pretrained("SamLowe/roberta-base-go_emotions")

# Optional - set pad to only pad to longest in batch, not a fixed length.
# (without this, the model will run slower, esp for shorter input strings)
params = {**tokenizer.padding, "length": None}
tokenizer.enable_padding(**params)

tokens_obj = tokenizer.encode_batch(sentences)

def load_onnx_model(model_filepath):
    _options = ort.SessionOptions()
    _options.inter_op_num_threads, _options.intra_op_num_threads = cpu_count(), cpu_count()
    _providers = ["CPUExecutionProvider"]  # could use ort.get_available_providers()
    return ort.InferenceSession(path_or_bytes=model_filepath, sess_options=_options, providers=_providers)

model = load_onnx_model("path_to_model_dot_onnx_or_model_quantized_dot_onnx")
output_names = [model.get_outputs()[0].name]  # E.g. ["logits"]

input_feed_dict = {
  "input_ids": [t.ids for t in tokens_obj],
  "attention_mask": [t.attention_mask for t in tokens_obj]
}

logits = model.run(output_names=output_names, input_feed=input_feed_dict)[0]
# produces a numpy array, one row per input item, one col per label

def sigmoid(x):
  return 1.0 / (1.0 + np.exp(-x))

# Post-processing. Gets the scores per label in range.
# Auto done by Transformers' pipeline, but we must do it manually with ORT.
model_outputs = sigmoid(logits) 

# for example, just to show the top result per input item
for probas in model_outputs:
  top_result_index = np.argmax(probas)
  print(labels[top_result_index], "with score:", probas[top_result_index])
```

### Example notebook: showing usage, accuracy & performance

Notebook with more details to follow.