File size: 3,837 Bytes
0018ee1
 
f6d0d20
0018ee1
 
 
 
 
 
 
 
 
fc91c97
 
 
 
 
 
808563e
fc91c97
6e7dd1d
808563e
 
fc91c97
b8840b1
543888f
fc91c97
0018ee1
 
 
fc91c97
 
 
 
 
 
 
 
 
 
 
0018ee1
f6d0d20
0018ee1
f6d0d20
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0018ee1
 
 
f6d0d20
 
 
 
 
 
 
 
 
 
0018ee1
 
 
 
 
 
 
8a978fa
 
0018ee1
 
 
543888f
9a42a20
8a978fa
 
 
 
 
 
 
 
 
 
543888f
 
8a978fa
 
0018ee1
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
---
license: mit
language: es
tags:
- generated_from_trainer
model-index:
- name: poem-gen-spanish-t5-small
  results: []
---

# poem-gen-spanish-t5-small

This model is a fine-tuned version of [flax-community/spanish-t5-small](https://huggingface.co/flax-community/spanish-t5-small) on the [Spanish Poetry Dataset](https://www.kaggle.com/andreamorgar/spanish-poetry-dataset/version/1) dataset.

The model was created during the [First Spanish Hackathon](https://somosnlp.org/hackathon) organized by [Somos NLP](https://somosnlp.org/).

The team who participated was composed by:

- 🇨🇺 [Alberto Carmona Barthelemy](https://huggingface.co/milyiyo)
- 🇪🇸 [Andrea Morales Garzón](https://huggingface.co/andreamorgar)
- 🇨🇴 [Jorge Henao](https://huggingface.co/jorge-henao)
- 🇮🇳 [Drishti Sharma](https://huggingface.co/DrishtiSharma)


It achieves the following results on the evaluation set:
- Loss: 2.8586
- Perplexity: 17.43

## Model description

The model was trained to generate spanish poems attending to some parameters like style, sentiment, words to include and starting phrase.

Example:

```
poema:
  estilo: Pablo Neruda &&
  sentimiento: positivo &&
  palabras: cielo, luna, mar &&
  texto: Todos fueron a verle pasar
```

### How to use

You can use this model directly with a pipeline for masked language modeling:

```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_name = 'hackathon-pln-es/poem-gen-spanish-t5-small'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

author, sentiment, word, start_text = 'Pablo Neruda', 'positivo', 'cielo', 'Todos fueron a la plaza'
input_text = f"""poema: estilo: {author} && sentimiento: {sentiment} && palabras: {word} && texto: {start_text} """
inputs = tokenizer(input_text, return_tensors="pt")

outputs = model.generate(inputs["input_ids"],
                         do_sample = True,
                         max_length = 30,
                         repetition_penalty = 20.0,
                         top_k = 50,
                         top_p = 0.92)
detok_outputs = [tokenizer.decode(x, skip_special_tokens=True) for x in outputs]
res = detok_outputs[0]
```

## Training and evaluation data

The original dataset has the columns `author`, `content` and `title`.
For each poem we generate new examples:
- content: *line_i* , generated: *line_i+1*
- content: *concatenate(line_i, line_i+1)* , generated: *line_i+2*
- content: *concatenate(line_i, line_i+1, line_i+2)* , generated: *line_i+3*

The resulting dataset has the columns `author`, `content`, `title` and `generated`.

For each example we compute the sentiment of the generated column and the nouns. In the case of sentiment, we used the model `mrm8488/electricidad-small-finetuned-restaurant-sentiment-analysis` and for nouns extraction we used spaCy.
 

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 2e-05
- train_batch_size: 6
- eval_batch_size: 6
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 6

### Training results

| Training Loss | Epoch | Step   | Validation Loss |
|:-------------:|:-----:|:------:|:---------------:|
| 3.1354        | 0.73  | 30000  | 3.0147          |
| 2.9761        | 1.46  | 60000  | 2.9498          |
| 2.897         | 2.19  | 90000  | 2.9019          |
| 2.8292        | 2.93  | 120000 | 2.8792          |
| 2.7774        | 3.66  | 150000 | 2.8738          |
| 2.741         | 4.39  | 180000 | 2.8634          |
| 2.7128        | 5.12  | 210000 | 2.8666          |
| 2.7108        | 5.85  | 240000 | 2.8595          |


### Framework versions

- Transformers 4.17.0
- Pytorch 1.10.0+cu111
- Datasets 2.0.0
- Tokenizers 0.11.6