The model has been trained to predict for English sentences, whether they are formal or informal.

Base model: roberta-base

Datasets: GYAFC from Rao and Tetreault, 2018 and online formality corpus from Pavlick and Tetreault, 2016.

Data augmentation: changing texts to upper or lower case; removing all punctuation, adding dot at the end of a sentence. It was applied because otherwise the model is over-reliant on punctuation and capitalization and does not pay enough attention to other features.

Loss: binary classification (on GYAFC), in-batch ranking (on PT data).

Performance metrics on the test data:

dataset ROC AUC precision recall fscore accuracy Spearman
GYAFC 0.9779 0.90 0.91 0.90 0.9087 0.8233
GYAFC normalized (lowercase + remove punct.) 0.9234 0.85 0.81 0.82 0.8218 0.7294
P&T subset Spearman R
news 0.4003
answers 0.7500
blog 0.7334
email 0.7606

Citation

If you are using the model in your research, please cite the following paper where it was introduced:

@InProceedings{10.1007/978-3-031-35320-8_4,
  author="Babakov, Nikolay
  and Dale, David
  and Gusev, Ilya
  and Krotova, Irina
  and Panchenko, Alexander",
  editor="M{\'e}tais, Elisabeth
  and Meziane, Farid
  and Sugumaran, Vijayan
  and Manning, Warren
  and Reiff-Marganiec, Stephan",
  title="Don't Lose the Message While Paraphrasing: A Study on Content Preserving Style Transfer",
  booktitle="Natural Language Processing and Information Systems",
  year="2023",
  publisher="Springer Nature Switzerland",
  address="Cham",
  pages="47--61",
  isbn="978-3-031-35320-8"
}

Licensing Information

Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

CC BY-NC-SA 4.0

Downloads last month
877
Safetensors
Model size
125M params
Tensor type
I64
ยท
F32
ยท
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Space using s-nlp/roberta-base-formality-ranker 1