FredZhang7
/

one-for-all-toxicity-v3

Text Classification

Inference Endpoints

Model card Files Files and versions Community

one-for-all-toxicity-v3 / README.md

FredZhang7's picture

Update README.md

a6e2920 11 months ago

|

No virus

3.07 kB

	---
	license: cc-by-4.0
	datasets:
	- FredZhang7/toxi-text-3M
	pipeline_tag: text-classification
	language:
	- ar
	- es
	- pa
	- th
	- et
	- fr
	- fi
	- hu
	- lt
	- ur
	- so
	- pl
	- el
	- mr
	- sk
	- gu
	- he
	- af
	- te
	- ro
	- lv
	- sv
	- ne
	- kn
	- it
	- mk
	- cs
	- en
	- de
	- da
	- ta
	- bn
	- pt
	- sq
	- tl
	- uk
	- bg
	- ca
	- sw
	- hi
	- zh
	- ja
	- hr
	- ru
	- vi
	- id
	- sl
	- cy
	- ko
	- nl
	- ml
	- tr
	- fa
	- 'no'
	- multilingual
	tags:
	- nlp
	- moderation
	---

	Find the v1 (TensorFlow) model on [this page](https://github.com/FredZhang7/tfjs-node-tiny/releases/tag/text-classification).
	The license for the v1 model is Apache 2.0

	<br>

	\| \| v3 \| v1 \|
	\|----------\|----------\|----------\|
	\| Base Model \| bert-base-multilingual-cased \| nlpaueb/legal-bert-small-uncased \|
	\| Base Tokenizer \| bert-base-multilingual-cased \| bert-base-multilingual-cased \|
	\| Framework \| PyTorch \| TensorFlow \|
	\| Dataset Size \| 3.0M \| 2.68M \|
	\| Train Split \| 80% English<br>20% English + 100% Multilingual \| None \|
	\| English Train Accuracy \| 99.5% \| N/A (≈97.5%) \|
	\| Other Train Accuracy \| 98.6% \| 96.6% \|
	\| Final Val Accuracy \| 96.8% \| 94.6% \|
	\| Languages \| 55 \| N/A (≈35) \|
	\| Hyperparameters \| maxlen=208<br>padding='max_length'<br>batch_size=112<br>optimizer=AdamW<br>learning_rate=1e-5<br>loss=BCEWithLogitsLoss() \| maxlen=192<br>padding='max_length'<br>batch_size=16<br>optimizer=Adam<br>learning_rate=1e-5<br>loss="binary_crossentropy" \|
	\| Training Stopped \| 7/20/2023 \| 9/05/2022 \|

	<br>

	I manually annotated more data on top of Toxi Text 3M and added them to the training set.
	Training on Toxi Text 3M alone results in a biased model that classifies short text with lower precision.

	<br>

	Models tested for v2: roberta, xlm-roberta, bert-small, bert-base-cased/uncased, bert-multilingual-cased/uncased, and alberta-large-v2.
	Of these, I chose bert-multilingual-cased because it performs better with the same amount of resources as the others for this particular task.

	<br>

	## PyTorch

	```python
	text = "hello world!"

	import torch
	from transformers import AutoTokenizer, AutoModelForSequenceClassification

	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	tokenizer = AutoTokenizer.from_pretrained("FredZhang7/one-for-all-toxicity-v3")
	model = AutoModelForSequenceClassification.from_pretrained("FredZhang7/one-for-all-toxicity-v3").to(device)

	encoding = tokenizer.encode_plus(
	text,
	add_special_tokens=True,
	max_length=208,
	padding="max_length",
	truncation=True,
	return_tensors="pt"
	)
	print('device:', device)
	input_ids = encoding["input_ids"].to(device)
	attention_mask = encoding["attention_mask"].to(device)

	with torch.no_grad():
	outputs = model(input_ids, attention_mask=attention_mask)
	logits = outputs.logits
	predicted_labels = torch.argmax(logits, dim=1)

	print(predicted_labels)
	```

	## Attribution
	- If you distribute, remix, adapt, or build upon One-for-all Toxicity v3, please credit "AIstrova Technologies Inc." in your README.md, application description, research, or website.