File size: 3,186 Bytes
d4e9bd7 28ff85c d4e9bd7 4c64a26 28ff85c 4c64a26 28ff85c 4c64a26 28ff85c 4c64a26 28ff85c 4c64a26 28ff85c 4c64a26 28ff85c 4c64a26 28ff85c 4c64a26 28ff85c 4c64a26 28ff85c 4c64a26 28ff85c 4c64a26 28ff85c 4c64a26 28ff85c 4c64a26 28ff85c 4c64a26 28ff85c 4c64a26 28ff85c 4c64a26 28ff85c 4c64a26 28ff85c 4c64a26 28ff85c 4c64a26 28ff85c 4c64a26 28ff85c 4c64a26 28ff85c 4c64a26 28ff85c 4c64a26 28ff85c 4c64a26 28ff85c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 |
---
license: apache-2.0
datasets:
- mediabiasgroup/BABE
language:
- en
pipeline_tag: text-classification
---
# Model Card for Model ID
<!-- Provide a quick summary of what the model is/does. -->
This model is designed to detect bias in text data.
It analyzes text inputs to identify and classify types of biases,
aiding in the development of more inclusive and fair AI systems.
The model is fine-tuned from valurank/distilroberta-bias model for research purpose. The model is able to detect bias in formal language since the
training corpus is news titles.
## Model Details
### Model Description
<!-- Provide a longer summary of what this model is. -->
The data used for fine-tuning is MBIC dataset, which contains texts with bias labels.
The model is capable of classifying any text into Biased or Non_biased. Max length set for the tokenizer is 512.
- **Developed by:** [More Information Needed]
- **Model type:** DistillRoBERTa transformer
- **Language(s) (NLP):** English
- **License:** Apache 2.0
- **Finetuned from model:** valurank/distilroberta-bias
- **Repository:** ***To be uploaded***
### The following sections are under construction...
<!--### Recommendations
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
<!--Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.
More information needed for further recommendations. -->
## How to Get Started with the Model
Use the code below to get started with the model.
***Link to the github demo page to be included***
[More Information Needed]
## Training Details
******
Size of the Dataset: 1700 entries
Preprocessing Steps: Tokenization using a pre-specified tokenizer, padding, and truncation to convert text to numerical features. Classes are encoded numerically.
Data Splitting Strategy: 80% training, 20% validation split, with a random state for reproducibility.
Optimization Algorithm: AdamW
Loss Function: CrossEntropyLoss, weighted by class frequencies to address class imbalance.
Learning Rate: 1e-5
Number of Epochs: 3
Batch Size: 16
Regularization Techniques: Gradient clipping is applied with a max norm of 1.0.
Model-Specific Hyperparameters: Scheduler with step size of 3 and gamma of 0.1 for learning rate decay.
Training time: around 150 iterations/s under CUDA pytorch, less than 10 minutes for training.
Monitoring Strategies: Training and validation losses and validation accuracy are monitored.
Details on the Validation Dataset: Generated from the same DataFrame df using a train-test split.
Techniques Used for Fine-tuning: Learning rate scheduler for adjusting the learning rate.
## Challenges and Solutions
**Challenges Faced During Training**: Class imbalance is addressed through weighted CrossEntropyLoss.
**Solutions and Techniques Applied**: Calculation of class weights from the training data and applying gradient clipping.
#### Metrics
<!-- These are the evaluation metrics being used, ideally with a description of why. -->
[More Information Needed]
### Results
[More Information Needed]
#### Summary
### Model Update Log |