jrippert commited on
Commit
6844110
1 Parent(s): e3004fe

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +155 -195
README.md CHANGED
@@ -1,198 +1,158 @@
1
  ---
2
  license: mit
3
  ---
4
-
5
- # Model Card for Model ID
6
-
7
- <!-- Provide a quick summary of what the model is/does. -->
8
-
9
- This modelcard aims to be a base template for new models. It has been generated using [this raw template](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/modelcard_template.md?plain=1).
10
-
11
- ## Model Details
12
-
13
- ### Model Description
14
-
15
- <!-- Provide a longer summary of what this model is. -->
16
-
17
-
18
-
19
- - **Developed by:** [More Information Needed]
20
- - **Funded by [optional]:** [More Information Needed]
21
- - **Shared by [optional]:** [More Information Needed]
22
- - **Model type:** [More Information Needed]
23
- - **Language(s) (NLP):** [More Information Needed]
24
- - **License:** [More Information Needed]
25
- - **Finetuned from model [optional]:** [More Information Needed]
26
-
27
- ### Model Sources [optional]
28
-
29
- <!-- Provide the basic links for the model. -->
30
-
31
- - **Repository:** [More Information Needed]
32
- - **Paper [optional]:** [More Information Needed]
33
- - **Demo [optional]:** [More Information Needed]
34
-
35
- ## Uses
36
-
37
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
38
-
39
- ### Direct Use
40
-
41
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
42
-
43
- [More Information Needed]
44
-
45
- ### Downstream Use [optional]
46
-
47
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
48
-
49
- [More Information Needed]
50
-
51
- ### Out-of-Scope Use
52
-
53
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
54
-
55
- [More Information Needed]
56
-
57
- ## Bias, Risks, and Limitations
58
-
59
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
60
-
61
- [More Information Needed]
62
-
63
- ### Recommendations
64
-
65
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
66
-
67
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
68
-
69
- ## How to Get Started with the Model
70
-
71
- Use the code below to get started with the model.
72
-
73
- [More Information Needed]
74
-
75
- ## Training Details
76
-
77
- ### Training Data
78
-
79
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
80
-
81
- [More Information Needed]
82
-
83
- ### Training Procedure
84
-
85
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
86
-
87
- #### Preprocessing [optional]
88
-
89
- [More Information Needed]
90
-
91
-
92
- #### Training Hyperparameters
93
-
94
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
95
-
96
- #### Speeds, Sizes, Times [optional]
97
-
98
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
99
-
100
- [More Information Needed]
101
-
102
- ## Evaluation
103
-
104
- <!-- This section describes the evaluation protocols and provides the results. -->
105
-
106
- ### Testing Data, Factors & Metrics
107
-
108
- #### Testing Data
109
-
110
- <!-- This should link to a Dataset Card if possible. -->
111
-
112
- [More Information Needed]
113
-
114
- #### Factors
115
-
116
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
117
-
118
- [More Information Needed]
119
-
120
- #### Metrics
121
-
122
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
123
-
124
- [More Information Needed]
125
-
126
- ### Results
127
-
128
- [More Information Needed]
129
-
130
- #### Summary
131
-
132
-
133
-
134
- ## Model Examination [optional]
135
-
136
- <!-- Relevant interpretability work for the model goes here -->
137
-
138
- [More Information Needed]
139
-
140
- ## Environmental Impact
141
-
142
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
143
-
144
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
145
-
146
- - **Hardware Type:** [More Information Needed]
147
- - **Hours used:** [More Information Needed]
148
- - **Cloud Provider:** [More Information Needed]
149
- - **Compute Region:** [More Information Needed]
150
- - **Carbon Emitted:** [More Information Needed]
151
-
152
- ## Technical Specifications [optional]
153
-
154
- ### Model Architecture and Objective
155
-
156
- [More Information Needed]
157
-
158
- ### Compute Infrastructure
159
-
160
- [More Information Needed]
161
-
162
- #### Hardware
163
-
164
- [More Information Needed]
165
-
166
- #### Software
167
-
168
- [More Information Needed]
169
-
170
- ## Citation [optional]
171
-
172
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
173
-
174
- **BibTeX:**
175
-
176
- [More Information Needed]
177
-
178
- **APA:**
179
-
180
- [More Information Needed]
181
-
182
- ## Glossary [optional]
183
-
184
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
185
-
186
- [More Information Needed]
187
-
188
- ## More Information [optional]
189
-
190
- [More Information Needed]
191
-
192
- ## Model Card Authors [optional]
193
-
194
- [More Information Needed]
195
-
196
- ## Model Card Contact
197
-
198
- [More Information Needed]
 
1
  ---
2
  license: mit
3
  ---
4
+ # chat compAnIon
5
+ <br />
6
+ <div align="center">
7
+ <a href="https://https://github.com/chatcompAnIon/chatcompAnIon">
8
+ <img src="images/Chat Companion Logo.png" alt="Logo" width="400" height="400">
9
+ </a>
10
+
11
+ <h3 align="center">Welcome to chat compAnIon's repo!</h3>
12
+
13
+ <p align="center">
14
+ <br />
15
+ <a href="https://cleemazzulla.github.io/chatcompAnIon/"><strong>Visit our Webpage»</strong></a>
16
+ <br />
17
+ <br />
18
+ </p>
19
+ </div>
20
+
21
+ <!-- ADD IN LATER TABLE OF CONTENTS -->
22
+
23
+ <!-- GETTING STARTED -->
24
+
25
+ ## Model Overview
26
+
27
+ **compAnIonv1** is a transformer-based large language model (LLM) trained for child grooming text classification in gaming chat room environments. It is a lightweight model with only 110,479,516 total parameters designed to deliver classification decisions in milliseconds within chat room dialogues.
28
+
29
+ Predicting child grooming is an incredibly complex NLP task, further complicated by the significant class imbalance due to the scarcity of publicly available grooming chat data and the prevalence of nongrooming chat data. Our dataset, therefore, consists of ~3% positive classes. Through a combination of up/downsampling, we managed to achieve a final training data mix consisting of 25% positive grooming class instances. To address the remaining class imbalance, the model was trained using the [Binary Focal Crossentropy](https://arxiv.org/abs/1708.02002v2) loss function with a customized gamma, aimed at penalizing the model for overfitting on the easier-to-predict class.
30
+
31
+ The model was initially designed to be trained on the chat texts, but it also adapts to new features engineered from linguistic analysis within the field of child grooming. As child grooming (especially in digital chats) often follows a [lifecycle of stages](https://safechild.org/understanding-grooming/) such as (building trust, isolating the child, etc.) we were able to capture and extract such features through various techniques such as through homegrown Bag of Words type models. The chat texts concatenated with the newly extracted grooming stages features were fed into the bert-base-uncased encoder to build embedding representations. The pooler output of the BERT model with a dimension of 768 was extracted for each conversation. This embedding representation was fed into dense neural network layers to produce an ultimate binary classification output.
32
+
33
+ However, after exhaustive ablation studies and model architecture experiments, we discovered that including 1D convolutional layers on top of our text embeddings was a much more effective and automated way to extract features. As such, **compAnIonv1** relies solely on the convolutional filters to act as feature extractors before feeding into the dense neural network layers.
34
+
35
+ ### Technical Specs & Hardware
36
+
37
+ | **Training Specs** | **compAnIonv1** |
38
+ | :--- | :---: |
39
+ | Instance Size | NVIDIA g5.4xlarge |
40
+ | GPU | 1 |
41
+ | GPU Memory (GiBs) | 24 |
42
+ | vCPUs | 16 |
43
+ | Memory (GiB) | 64 |
44
+ | Instance Storage (GB) | 1 x 600 NVMe SSD |
45
+ | Network Bandwidth (Gbps) | 25 |
46
+ | EBS Bandwidth (Gbps) | 8 |
47
+
48
+ ### Model Data
49
+ Our model was trained on non-grooming chat data from several sources including IRC Logs, Omegle, and the Chit Chats dataset. See detailed table below:
50
+ <table>
51
+ <tr>
52
+ <th scope="col">Dataset</th>
53
+ <th scope="col">Sources</th>
54
+ <th scope="col"># Grooming conversations</th>
55
+ <th scope="col"># Non-grooming conversations</th>
56
+ <th scope="col"># Total conversations</th>
57
+ </tr>
58
+ <tr>
59
+ <th scope="row">PAN12 Train</th>
60
+ <td>Perverted Justice (True positives), IRC logs (True negatives), Omegle (False positives)</td>
61
+ <td style="text-align: center;">2,015</td>
62
+ <td style="text-align: center;">65,992</td>
63
+ <td style="text-align: center;">68,007</td>
64
+ </tr>
65
+ <tr>
66
+ <th scope="row">PAN12 Test</th>
67
+ <td>Perverted Justice (True positives), IRC logs (True negatives), Omegle (False positives)</td>
68
+ <td style="text-align: center;">3,723</td>
69
+ <td style="text-align: center;">153,262</td>
70
+ <td style="text-align: center;">156,985</td>
71
+ </tr>
72
+ <tr>
73
+ <th scope="row">PJZC</th>
74
+ <td>Perverted Justice (True positives)</td>
75
+ <td style="text-align: center;">1,104</td>
76
+ <td style="text-align: center;">0</td>
77
+ <td style="text-align: center;">1,104</td>
78
+ </tr>
79
+ </table>
80
+
81
+ <dl>
82
+ <dt><strong>PAN12:</strong></dt>
83
+ <dd>Put together as part of a 2012 competition to analyze sexual predators and identify high risk text.</dd>
84
+ <dt><strong>PJZC:</strong></dt>
85
+ <dd>Milon-Flores and Cordeiro put together PJZC using the same method as PAN12, but with newer data. Because PAN12-train was already imbalanced, we decided to use just the grooming conversations for training.</dd>
86
+ <dt><strong>NOTE:</strong> There is no overlap between PAN12 and PJZC; PJZC conversations from the Perverted Justice are from 2013-2014.</dt>
87
+ </dl>
88
+
89
+ See our [Datasets](https://github.com/chatcompAnIon/chatcompAnIon/tree/main/Datasets) folder for our pre-processed data.
90
+
91
+
92
+ ## Getting Started
93
+ To help combat what has been deemed an as *AN INDUSTRY WITHOUT AN ANSWER*, chat compAnIon is making the first model **compAnIonv1** publicly available. To help facilitate reproducability we have made our model available via Hugging Face: [chatcompanion/compAnIon-v1.0](https://huggingface.co/chatcompanion/compAnIon-v1.0)
94
+
95
+ ### Prerequisites
96
+
97
+ In order to run compAnIon-v1.0, the following installs are required:
98
+
99
+ ```python
100
+ ! pip install -q transformers==4.17
101
+ !git lfs install
102
+ ```
103
+
104
+ ### Installation & Example Run
105
+
106
+ Below is an example of how you can clone our repo to access our trained model and quickly run predictions from a notebook environment:
107
+
108
+ 1. Clone the repo
109
+ ```python
110
+ !git clone https://huggingface.co/chatcompanion/compAnIonv1
111
+ ```
112
+ 3. Change directories in order to access the trained weights file
113
+ ```python
114
+ %cd compAnIonv1
115
+ ```
116
+ 4. Import the **compAnIonv1** model
117
+ ```python
118
+ from compAnIonv1 import *
119
+ ```
120
+ 5. Run inference
121
+ ```python
122
+ texts = ["School is so boring, I want to be a race car driver when I grow up!",
123
+ "I can pick you up from school tomorrow, but don't tell your parents, ok?"]
124
+
125
+ run_inference_model(texts)
126
+ ```
127
+
128
+ ## Ethics and Safety
129
+ * The team did not conduct any new labeling of our dataset to avoid imputing our biases regarding what constitutes child grooming. All of our positive class grooming instances stem from grooming chat logs used as evidence in successful court convictions of sexual predators.
130
+ * This model is intended to be a tool for parents to help detect and mitigate digital child grooming. We acknowledge the real impact of misclassification, such as false positives potentially damaging parent-child relationships and unintended potential consequences for the false accused.
131
+
132
+ ## Intended Usage
133
+ * The model's intended use case is to predict child grooming in chat room conversations. It is intended to be a supportive tool for parents and is not to be considered a source of truth.
134
+ * However, there may be multiple other use cases, especially within the Trust & Safety space companies may explore. For instance, companies with chat environments may benefit from using such a model along with sentiment analysis to monitor their chat rooms. Additionally, we believe this model will lend itself well to niche detection use cases such as elderly scam prevention and cyberbullying.
135
+ * This model paves the way for incorporating child grooming detection and linguistic analysis into AI models. However, to truly propel this field forward, we recognize the necessity for continued research, particularly in the area of feature extraction from text as it pertains to child grooming.
136
+
137
+ ## Limitations
138
+ * **compAnIonv1** is primarily trained on English text; it will only generalize well to other languages with additional training.
139
+ * Our model was trained to predict based on a token window size of 400. Conversations may vary in length, so the model's reliability might become constrained when running on extensively long conversations.
140
+ * Language is ever-changing, especially among children. The model may perform poorly if there are shifts in grooming stages and their representation in linguistic syntax.
141
+
142
+ <!-- CONTACT -->
143
+ ## Contact
144
+
145
+ * [Courtney Mazzulla](https://www.linkedin.com/in/courtney-l-mazzulla/) - cleemazzulla@berkeley.edu
146
+ * [Julian Rippert](https://www.linkedin.com/in/julianrippert/) - jrippert@berkeley.edu
147
+ * [Sunny Shin](https://www.linkedin.com/in/sunnyshin1/) - sunnyshin@berkeley.edu
148
+ * [Raymond Tang](https://www.linkedin.com/in/raymond-tang-0807aa1/) - raymond.tang@berkeley.edu
149
+ * [Leon Gutierrez](https://www.linkedin.com/in/leongutierrez29/) - leonrafael29@berkeley.edu
150
+ * [Karsyn Lee](https://www.linkedin.com/in/karsynlee/) - karsyn@berkeley.edu
151
+
152
+ [Project Website](https://cleemazzulla.github.io/chatcompAnIon/)
153
+
154
+
155
+ <!-- ACKNOWLEDGMENTS -->
156
+ ## Acknowledgments
157
+
158
+ This project was developed as a part of UC Berkeley's Master of Information and Data Science Capstone. We thank our Capstone advisors, Joyce Shen and Korin Reid, for their extensive guidance and continued support. We also invite you to visit our cohort's projects as well: [MIDS Capstone Projects: Spring 2024](https://www.ischool.berkeley.edu/programs/mids/capstone/2024a-spring)