File size: 3,186 Bytes
d4e9bd7
 
28ff85c
 
 
 
 
d4e9bd7
4c64a26
 
 
 
 
28ff85c
 
 
 
 
 
4c64a26
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28ff85c
 
4c64a26
 
 
 
28ff85c
4c64a26
 
 
28ff85c
 
4c64a26
 
 
 
 
28ff85c
 
4c64a26
 
 
 
28ff85c
4c64a26
28ff85c
4c64a26
28ff85c
4c64a26
28ff85c
4c64a26
28ff85c
4c64a26
28ff85c
4c64a26
28ff85c
4c64a26
28ff85c
4c64a26
28ff85c
4c64a26
28ff85c
4c64a26
28ff85c
4c64a26
28ff85c
4c64a26
28ff85c
4c64a26
28ff85c
4c64a26
28ff85c
4c64a26
28ff85c
4c64a26
28ff85c
4c64a26
28ff85c
4c64a26
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28ff85c
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
---
license: apache-2.0
datasets:
- mediabiasgroup/BABE
language:
- en
pipeline_tag: text-classification
---

# Model Card for Model ID

<!-- Provide a quick summary of what the model is/does. -->

This model is designed to detect bias in text data.
It analyzes text inputs to identify and classify types of biases,
aiding in the development of more inclusive and fair AI systems.
The model is fine-tuned from valurank/distilroberta-bias model for research purpose. The model is able to detect bias in formal language since the 
training corpus is news titles. 

## Model Details

### Model Description

<!-- Provide a longer summary of what this model is. -->
The data used for fine-tuning is MBIC dataset, which contains texts with bias labels.

The model is capable of classifying any text into Biased or Non_biased. Max length set for the tokenizer is 512.




- **Developed by:** [More Information Needed]
- **Model type:** DistillRoBERTa transformer
- **Language(s) (NLP):** English
- **License:** Apache 2.0
- **Finetuned from model:** valurank/distilroberta-bias
- **Repository:** ***To be uploaded***

### The following sections are under construction...


<!--### Recommendations

<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->

<!--Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.
More information needed for further recommendations. -->

## How to Get Started with the Model

Use the code below to get started with the model.

***Link to the github demo page to be included***

[More Information Needed]

## Training Details

******

Size of the Dataset: 1700 entries

Preprocessing Steps: Tokenization using a pre-specified tokenizer, padding, and truncation to convert text to numerical features. Classes are encoded numerically.

Data Splitting Strategy: 80% training, 20% validation split, with a random state for reproducibility.

Optimization Algorithm: AdamW

Loss Function: CrossEntropyLoss, weighted by class frequencies to address class imbalance.

Learning Rate: 1e-5

Number of Epochs: 3

Batch Size: 16

Regularization Techniques: Gradient clipping is applied with a max norm of 1.0.

Model-Specific Hyperparameters: Scheduler with step size of 3 and gamma of 0.1 for learning rate decay.

Training time: around 150 iterations/s under CUDA pytorch, less than 10 minutes for training.

Monitoring Strategies: Training and validation losses and validation accuracy are monitored.

Details on the Validation Dataset: Generated from the same DataFrame df using a train-test split.

Techniques Used for Fine-tuning: Learning rate scheduler for adjusting the learning rate.

## Challenges and Solutions

**Challenges Faced During Training**: Class imbalance is addressed through weighted CrossEntropyLoss.

**Solutions and Techniques Applied**: Calculation of class weights from the training data and applying gradient clipping.




#### Metrics

<!-- These are the evaluation metrics being used, ideally with a description of why. -->

[More Information Needed]

### Results

[More Information Needed]

#### Summary



### Model Update Log