osanseviero's picture
Librarian Bot: Add base_model information to model (#2)
e8e52a9
metadata
language: es
license: mit
tags:
  - generated_from_trainer
base_model: flax-community/spanish-t5-small
model-index:
  - name: poem-gen-spanish-t5-small
    results: []

poem-gen-spanish-t5-small

This model is a fine-tuned version of flax-community/spanish-t5-small on the Spanish Poetry Dataset dataset.

The model was created during the First Spanish Hackathon organized by Somos NLP.

The team who participated was composed by:

It achieves the following results on the evaluation set:

  • Loss: 2.8707
  • Perplexity: 17.65

Model description

The model was trained to generate spanish poems attending to some parameters like style, sentiment, words to include and starting phrase.

Example:

poema:
  estilo: Pablo Neruda &&
  sentimiento: positivo &&
  palabras: cielo, luna, mar &&
  texto: Todos fueron a verle pasar

How to use

You can use this model directly with a pipeline for masked language modeling:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_name = 'hackathon-pln-es/poem-gen-spanish-t5-small'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

author, sentiment, word, start_text = 'Pablo Neruda', 'positivo', 'cielo', 'Todos fueron a la plaza'
input_text = f"""poema: estilo: {author} && sentimiento: {sentiment} && palabras: {word} && texto: {start_text} """
inputs = tokenizer(input_text, return_tensors="pt")

outputs = model.generate(inputs["input_ids"],
                         do_sample = True,
                         max_length = 30,
                         repetition_penalty = 20.0,
                         top_k = 50,
                         top_p = 0.92)
detok_outputs = [tokenizer.decode(x, skip_special_tokens=True) for x in outputs]
res = detok_outputs[0]

Training and evaluation data

The original dataset has the columns author, content and title. For each poem we generate new examples:

  • content: line_i , generated: line_i+1
  • content: concatenate(line_i, line_i+1) , generated: line_i+2
  • content: concatenate(line_i, line_i+1, line_i+2) , generated: line_i+3

The resulting dataset has the columns author, content, title and generated.

For each example we compute the sentiment of the generated column and the nouns. In the case of sentiment, we used the model mrm8488/electricidad-small-finetuned-restaurant-sentiment-analysis and for nouns extraction we used spaCy.

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 2e-05
  • train_batch_size: 6
  • eval_batch_size: 6
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • num_epochs: 6

Training results

Training Loss Epoch Step Validation Loss
2.7082 0.73 30000 2.8878
2.6251 1.46 60000 2.8940
2.5796 2.19 90000 2.8853
2.5556 2.93 120000 2.8749
2.527 3.66 150000 2.8850
2.5024 4.39 180000 2.8760
2.4887 5.12 210000 2.8749
2.4808 5.85 240000 2.8707

Framework versions

  • Transformers 4.17.0
  • Pytorch 1.10.0+cu111
  • Datasets 2.0.0
  • Tokenizers 0.11.6