Update README.md

3022768 verified 3 months ago

No virus

3.85 kB

	---
	language:
	- en
	- fr
	- it
	- pt
	tags:
	- formality
	licenses:
	- cc-by-nc-sa
	license: cc-by-nc-sa-4.0
	---


	Model Overview

	This is the model presented in the paper "Detecting Text Formality: A Study of Text Classification Approaches".

	The original model is [mDistilBERT (base)](https://huggingface.co/distilbert-base-multilingual-cased). Then, it was fine-tuned on the multilingual corpus for fomality classiication [X-FORMAL](https://arxiv.org/abs/2104.04108) that consists of 4 languages -- English (from [GYAFC](https://arxiv.org/abs/1803.06535)), French, Italian, and Brazilian Portuguese.
	In our experiments, the model showed the best results within Transformer-based models for the cross-lingual formality classification knowledge transfer task. More details, code and data can be found [here](https://github.com/s-nlp/formality).

	Evaluation Results

	Here, we provide several metrics of the best models from each category participated in the comparison to understand the ranks of values. We report accuracy score for two setups -- multilingual model fine-tuned for each language separately and then fine-tuned on all languages.
	For cross-lingual experiments results, please, refer to the paper.

	\| \| En \| It \| Po \| Fr \| All \|
	\|------------------\|------\|------\|------\|------\|-------\|
	\| bag-of-words \| 79.1 \| 71.3 \| 70.6 \| 72.5 \| --- \|
	\| CharBiLSTM \| 87.0 \| 79.1 \| 75.9 \| 81.3 \| 82.7 \|
	\| mDistilBERT-cased\| 86.6 \| 76.8 \| 75.9 \| 79.1 \| 79.4 \|
	\| mDeBERTa-base \| 87.3 \| 76.6 \| 75.8 \| 78.9 \| 79.9 \|

	How to use
	```python
	from transformers import AutoModelForSequenceClassification, AutoTokenizer
	model_name = 'mdistilbert-base-formality-ranker'
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForSequenceClassification.from_pretrained(model_name)
	```

	Citation
	```
	@inproceedings{dementieva-etal-2023-detecting,
	title = "Detecting Text Formality: A Study of Text Classification Approaches",
	author = "Dementieva, Daryna and
	Babakov, Nikolay and
	Panchenko, Alexander",
	editor = "Mitkov, Ruslan and
	Angelova, Galia",
	booktitle = "Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing",
	month = sep,
	year = "2023",
	address = "Varna, Bulgaria",
	publisher = "INCOMA Ltd., Shoumen, Bulgaria",
	url = "https://aclanthology.org/2023.ranlp-1.31",
	pages = "274--284",
	abstract = "Formality is one of the important characteristics of text documents. The automatic detection of the formality level of a text is potentially beneficial for various natural language processing tasks. Before, two large-scale datasets were introduced for multiple languages featuring formality annotation{---}GYAFC and X-FORMAL. However, they were primarily used for the training of style transfer models. At the same time, the detection of text formality on its own may also be a useful application. This work proposes the first to our knowledge systematic study of formality detection methods based on statistical, neural-based, and Transformer-based machine learning methods and delivers the best-performing models for public usage. We conducted three types of experiments {--} monolingual, multilingual, and cross-lingual. The study shows the overcome of Char BiLSTM model over Transformer-based ones for the monolingual and multilingual formality classification task, while Transformer-based classifiers are more stable to cross-lingual knowledge transfer.",
	}
	```

	## Licensing Information

	[Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License][cc-by-nc-sa].

	[![CC BY-NC-SA 4.0][cc-by-nc-sa-image]][cc-by-nc-sa]

	[cc-by-nc-sa]: http://creativecommons.org/licenses/by-nc-sa/4.0/
	[cc-by-nc-sa-image]: https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png