File size: 3,192 Bytes

33265fb
13ab308
56f4e90
8928304
8959dc5
95a5efe
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f9e901a
735b0ef
95a5efe
 
5f9e9c5
33265fb
56f4e90
a2996bd
 
 
a6e2920
56f4e90
a2996bd
8cae0fb
 
11f1de9
95a5efe
 
 
 
11f1de9
95a5efe
955c23b
35cc9f1
4981b77
95a5efe
955c23b
35cc9f1
f1b36b5
 
 
9df036e
643560b
9df036e
f1b36b5
 
95a5efe
a6e2920

---
license: cc-by-4.0
datasets:
- FredZhang7/toxi-text-3M
pipeline_tag: text-classification
language:
- ar
- es
- pa
- th
- et
- fr
- fi
- hu
- lt
- ur
- so
- pl
- el
- mr
- sk
- gu
- he
- af
- te
- ro
- lv
- sv
- ne
- kn
- it
- mk
- cs
- en
- de
- da
- ta
- bn
- pt
- sq
- tl
- uk
- bg
- ca
- sw
- hi
- zh
- ja
- hr
- ru
- vi
- id
- sl
- cy
- ko
- nl
- ml
- tr
- fa
- 'no'
- multilingual
tags:
- nlp
- moderation
---

[Link to the distilbert spam defender](https://huggingface.co/FredZhang7/distilbert-spam-defender)

Find the v1 (TensorFlow) model in SavedModel format on [this page](https://github.com/FredZhang7/tfjs-node-tiny/releases/tag/text-classification).
The license for the v1 model is Apache 2.0


<br>

|          |    v3    |    v1    |
|----------|----------|----------|
| Base Model   | bert-base-multilingual-cased   |  nlpaueb/legal-bert-small-uncased   |
| Base Tokenizer   |  bert-base-multilingual-cased   |  bert-base-multilingual-cased  |
| Framework  | PyTorch   |  TensorFlow   |
| Dataset Size  |  3.0M |  2.68M   |
| Train Split | 80% English<br>20% English + 100% Multilingual |  None  |
| English Train Accuracy  |  99.5% |  N/A (≈97.5%)  |
| Other Train Accuracy  | 98.6%  |  96.6%  |
| Final Val Accuracy  |  96.8%  |  94.6%  |
| Languages |  55  |  N/A (≈35)  |
| Hyperparameters  | maxlen=208<br>padding='max_length'<br>batch_size=112<br>optimizer=AdamW<br>learning_rate=1e-5<br>loss=BCEWithLogitsLoss()  |  maxlen=192<br>padding='max_length'<br>batch_size=16<br>optimizer=Adam<br>learning_rate=1e-5<br>loss="binary_crossentropy"  |
| Training Stopped |  7/20/2023  |  9/05/2022  |

<br>

I manually annotated more data on top of Toxi Text 3M and added them to the training set.
Training on Toxi Text 3M alone results in a biased model that classifies short text with lower precision.

<br>

Models tested for v2: roberta, xlm-roberta, bert-small, bert-base-cased/uncased, bert-multilingual-cased/uncased, and alberta-large-v2.
Of these, I chose bert-multilingual-cased because it performs better with the same amount of resources as the others for this particular task.

<br>

## PyTorch

```python
text = "hello world!"

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained("FredZhang7/one-for-all-toxicity-v3")
model = AutoModelForSequenceClassification.from_pretrained("FredZhang7/one-for-all-toxicity-v3").to(device)

encoding = tokenizer.encode_plus(
    text,
    add_special_tokens=True,
    max_length=208,
    padding="max_length",
    truncation=True,
    return_tensors="pt"
)
print('device:', device)
input_ids = encoding["input_ids"].to(device)
attention_mask = encoding["attention_mask"].to(device)

with torch.no_grad():
    outputs = model(input_ids, attention_mask=attention_mask)
    logits = outputs.logits
    predicted_labels = torch.argmax(logits, dim=1)

print(predicted_labels)
```

## Attribution
- If you distribute, remix, adapt, or build upon One-for-all Toxicity v3, please credit "AIstrova Technologies Inc." in your README.md, application description, research, or website.