AgentPublic/camembert-base-toxic-fr-user-prompts

This model is a camembert-base model fine-tuned on a French translated toxic-chat dataset plus additional synthetic data. The model is trained to classify user prompts into three categories: "Toxic", "Non-Toxic", and "Sensible".

Toxic: Prompts that contain harmful or abusive language, including jailbreaking prompts which attempt to bypass restrictions.
Non-Toxic: Prompts that are safe and free of harmful content.
Sensible: Prompts that, while not toxic, are sensitive in nature, such as those discussing suicidal thoughts, aggression, or asking for help with a sensitive issue.

The evaluation results are as follows (still under evaluation, more data is needed):

	Precision	Recall	F1-Score
Non-Toxic	0.97	0.95	0.96
Sensible	0.95	0.99	0.98
Toxic	0.87	0.90	0.88

Accuracy			0.94
Macro Avg	0.93	0.95	0.94
Weighted Avg	0.94	0.94	0.94

Note: This model is still under development, and its performance and characteristics are subject to change as training is not yet complete.

AgentPublic
/

camembert-base-toxic-fr-user-prompts

Dataset used to train AgentPublic/camembert-base-toxic-fr-user-prompts

Collection including AgentPublic/camembert-base-toxic-fr-user-prompts

Albert