File size: 4,528 Bytes
5c3c8a8
 
 
 
 
4168479
 
24f4d16
4168479
 
 
 
 
 
 
12c599c
071eeeb
4168479
071eeeb
4168479
 
071eeeb
4168479
 
071eeeb
4168479
 
 
5c3c8a8
 
 
 
 
 
 
071eeeb
5c3c8a8
 
 
1d1b078
e692e14
1d1b078
33ea23a
071eeeb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7e4ae0f
071eeeb
 
5c3c8a8
071eeeb
 
 
 
 
5c3c8a8
 
 
 
071eeeb
 
5c3c8a8
071eeeb
 
5c3c8a8
 
 
 
 
 
60ccf52
5c3c8a8
 
071eeeb
5c3c8a8
 
071eeeb
5c3c8a8
 
071eeeb
5c3c8a8
 
071eeeb
5c3c8a8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
071eeeb
5c3c8a8
 
 
 
 
 
 
071eeeb
 
 
 
5c3c8a8
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
---
metrics:
- mse
- r_squared
- mae
datasets:
- google_wellformed_query
inference: false
model-index:
- name: distilroberta-query-wellformedness 
  results:
  - task:
      type: text-classification
      name: Text Classification
    metrics:
    - type: loss
      value: 0.061837393790483475
    - type: mse
      value: 0.061837393790483475
      name: Validation Mean Squared Error
    - type: r2
      value: 0.5726782083511353
      name: Validation R-Squared
    - type: mae
      value: 0.183049738407135
      name: Validation Mean Absolute Error
language:
- en
---
## DistilRoBERTa-query-wellformedness

This model utilizes the [Distilroberta base](https://huggingface.co/distilroberta-base) architecture, which has been fine-tuned for a regression task on the [Google's query wellformedness](https://huggingface.co/datasets/google_wellformed_query) dataset encompassing 25,100 queries from the Paralex corpus. Each query received annotations from five raters, who provided a continuous rating indicating the degree to which the query is well-formed.

## Model description

A regression head has been appended to the DistilRoBERTa model to tailor it for a regression task. This additional component is crucial and needs to be loaded alongside the base model during inference to ensure accurate predictions. The model evaluates the query for completeness and grammatical correctness, providing a score between 0 and 1, where 1 indicates correctness.

## Usage

Inference API has been disabled as this is a regression task, not a text classification task, and HuggingFace does not provide a pipeline for regression tasks.
Because of the dataset, it will perform better when handling queries in question form.

```python
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("AdamCodd/distilroberta-query-wellformedness")

class RegressionModel(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.model = AutoModelForSequenceClassification.from_pretrained("AdamCodd/distilroberta-query-wellformedness")
        self.regression_head = torch.nn.Linear(self.model.config.hidden_size, 1)

    def forward(self, input_ids, attention_mask, **kwargs):
        outputs = self.model.base_model(input_ids=input_ids, attention_mask=attention_mask)
        rating = self.regression_head(outputs.last_hidden_state[:, 0, :])
        rating = torch.sigmoid(rating)
        return rating.squeeze()

regression_model = RegressionModel()
# Do not forget to set the correct path to load the regression head
regression_model.regression_head.load_state_dict(torch.load("path_to_the_regression_head.pth"))
regression_model.eval()
# Examples
sentences = [
    "The cat and dog in the yard.",
    "she don't like apples.",
    "Is rain sunny days sometimes?",
    "She enjoys reading books and playing chess.",
    "How many planets are there in our solar system?"
]

inputs = tokenizer(sentences, truncation=True, padding=True, return_tensors='pt')

with torch.no_grad():
    outputs = regression_model(input_ids=inputs['input_ids'], attention_mask=inputs['attention_mask'])

predictions = outputs.tolist()
for i, rating in enumerate(predictions):
    print(f'Sentence: {sentences[i]}')
    print(f'Predicted Rating: {rating}\n')
```
Output:
```
Sentence: The cat and dog in the yard.
Predicted Rating: 0.20430190861225128

Sentence: she don't like apples.
Predicted Rating: 0.08289700001478195

Sentence: Is rain sunny days sometimes?
Predicted Rating: 0.20011138916015625

Sentence: She enjoys reading books and playing chess.
Predicted Rating: 0.8915354013442993

Sentence: How many planets are there in our solar system?
Predicted Rating: 0.974799394607544
```

## Training and evaluation data

More information needed

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 2e-05
- train_batch_size: 16
- eval_batch_size: 16
- seed: 42
- optimizer: AdamW with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 400
- num_epochs: 5

### Training results

Metrics: Mean Squared Error, R-Squared, Mean Absolute Error

```
'test_loss': 0.061837393790483475,
'test_mse': 0.061837393790483475,
'test_r2': 0.5726782083511353,
'test_mae': 0.183049738407135
```

### Framework versions

- Transformers 4.34.1
- Pytorch lightning 2.1.0
- Tokenizers 0.14.1

If you want to support me, you can [here](https://ko-fi.com/adamcodd).