cointegrated
/

rut5-small-normalizer

Text2Text Generation

denoising autoencoder

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

rut5-small-normalizer / README.md

cointegrated's picture

add widget example

77e813e over 3 years ago

|

history blame contribute delete

1.82 kB

	---
	language: "ru"
	tags:
	- normalization
	- denoising autoencoder
	- russian
	widget:
	- text: "меня тобой не понимать"
	license: mit
	---
	This is a small Russian denoising autoencoder. It can be used for restoring corrupted sentences.

	This model was produced by fine-tuning the [rut5-small](https://huggingface.co/cointegrated/rut5-small) model on the task of reconstructing a sentence:
	* restoring word positions (after slightly shuffling them)
	* restoring dropped words and punctuation marks (after dropping some of them randomly)
	* restoring inflection of words (after changing their inflection randomly using [natasha](https://github.com/natasha/natasha) and [pymorphy2](https://github.com/kmike/pymorphy2) packages)

	The fine-tuning was performed on a [Leipzig web corpus](https://wortschatz.uni-leipzig.de/en/download/Russian) of Russian sentences.

	The model can be applied as follows:
	```
	# !pip install transformers sentencepiece
	import torch
	from transformers import T5ForConditionalGeneration, T5Tokenizer
	tokenizer = T5Tokenizer.from_pretrained("cointegrated/rut5-small-normalizer")
	model = T5ForConditionalGeneration.from_pretrained("cointegrated/rut5-small-normalizer")

	text = 'меня тобой не понимать'
	inputs = tokenizer(text, return_tensors='pt')
	with torch.no_grad():
	hypotheses = model.generate(
	**inputs,
	do_sample=True, top_p=0.95,
	num_return_sequences=5,
	repetition_penalty=2.5,
	max_length=32,
	)
	for h in hypotheses:
	print(tokenizer.decode(h, skip_special_tokens=True))
	```
	A possible output is:
	```
	# Мне тебя не понимать.
	# Если бы ты понимаешь меня?
	# Я с тобой не понимаю.
	# Я тебя не понимаю.
	# Я не понимаю о чем ты.
	```