File size: 5,469 Bytes
66022a9 813b13c 66022a9 813b13c 66022a9 9ea7e95 66022a9 813b13c 66022a9 813b13c 66022a9 813b13c 66022a9 813b13c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 |
---
datasets:
- rebel-short
metrics:
- rouge
model-index:
- name: flan-t5-base
results:
- task:
name: Sequence-to-sequence Language Modeling
type: text2text-generation
dataset:
name: rebel-short
type: rebel-short
config: default
split: test
args: default
metrics:
- name: Rouge1
type: rouge
value: 51.5716
license: cc-by-sa-4.0
language:
- nl
pipeline_tag: text2text-generation
---
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->
# flan-rebel-nl
This model is a fine-tuned version of flan-t5-base on the rebel-short dataset.
It achieves the following results on the evaluation set:
- Loss: 0.1029
- Rouge1: 51.5716
- Rouge2: 40.2152
- Rougel: 49.9941
- Rougelsum: 49.9767
- Gen Len: 18.5898
## Model description
This is a flan-t5-base model fine-tuned on a Dutch dataset version based on RBEL: Relation Extraction By End-to-end Language generation. The model aims to extract triplets in the form {head, relation, tail} from unstructured text. The data for Dutch triplets and unstructured text was generated by using the code of the original authors of REBEL, available at https://github.com/Babelscape/crocodile.
## Pipeline usage
The code below is adopted from the original REBEL model: https://huggingface.co/Babelscape/rebel-large .
```python
from transformers import pipeline
triplet_extractor = pipeline('text2text-generation', model='Kbrek/flan_rebel_nl', tokenizer='Kbrek/flan_rebel_nl')
# We need to use the tokenizer manually since we need special tokens.
extracted_text = triplet_extractor("Nederland is een van de landen binnen het Koninkrijk der Nederlanden. Nederland ligt voor het overgrote deel in het noordwesten van Europa, aan de Noordzee. ", max_length = 512, num_beams = 3, temperature = 1)
# Function to parse the generated text and extract the triplets
def extract_triplets(text):
triplets = []
relation, subject, relation, object_ = '', '', '', ''
text = text.strip()
current = 'x'
for token in text.replace("<s>", "").replace("<pad>", "").replace("</s>", "").split():
if token == "<triplet>":
current = 't'
if relation != '':
triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
relation = ''
subject = ''
elif token == "<subj>":
current = 's'
if relation != '':
triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
object_ = ''
elif token == "<obj>":
current = 'o'
relation = ''
else:
if current == 't':
subject += ' ' + token
elif current == 's':
object_ += ' ' + token
elif current == 'o':
relation += ' ' + token
if subject != '' and relation != '' and object_ != '':
triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
return triplets
extracted_triplets = extract_triplets(extracted_text[0])
print(extracted_triplets)
```
A trick that might give you better results is by forcing the entities the model generates by extracting entities with a ner pipeline and forcing those tokens in the generated output.
```python
triplet_extractor = pipeline('text2text-generation', model='Kbrek/flan_rebel_nl', tokenizer='Kbrek/flan_rebel_nl')
ner_extractor = pipeline("ner", "Babelscape/wikineural-multilingual-ner", aggregation_strategy = "simple")
#extract ents
ner_output = ner_extractor(input_text)
ents = [i["word"] for i in ner_output]
if len(ents) > 0:
tokens = triplet_extractor.tokenizer(ents, add_special_tokens=False)["input_ids"]
extracted_text = triplet_extractor(input_text, max_length = 512, force_words_ids = tokens)
else:
extracted_text = triplet_extractor(input_text, max_length = 512, temperature = 1)
triplets = extract_triplets(extracted_text[0]["generated_text"])
```
## Training and evaluation data
Data used for developing and evaluating this model is generated by using https://github.com/Babelscape/crocodile .
## Training procedure
### Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 4
- eval_batch_size: 4
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 5
### Training results
| Training Loss | Epoch | Step | Validation Loss | Rouge1 | Rouge2 | Rougel | Rougelsum | Gen Len |
|:-------------:|:-----:|:------:|:---------------:|:-------:|:-------:|:-------:|:---------:|:-------:|
| 0.1256 | 1.0 | 22047 | 0.1206 | 50.3892 | 38.2761 | 48.7657 | 48.7444 | 18.6112 |
| 0.1091 | 2.0 | 44094 | 0.1112 | 50.9615 | 39.2843 | 49.3865 | 49.3674 | 18.5447 |
| 0.0875 | 3.0 | 66141 | 0.1047 | 51.2045 | 39.7598 | 49.6483 | 49.6317 | 18.5763 |
| 0.0841 | 4.0 | 88188 | 0.1036 | 51.3543 | 39.9776 | 49.8528 | 49.8223 | 18.6178 |
| 0.0806 | 5.0 | 110235 | 0.1029 | 51.5716 | 40.2152 | 49.9941 | 49.9767 | 18.5898 |
### Framework versions
- Transformers 4.27.2
- Pytorch 1.13.1+cu117
- Datasets 2.10.1
- Tokenizers 0.12.1 |