flan_rebel_nl / README.md
Kbrek's picture
Update README.md
f0695b1
|
raw
history blame
No virus
5.5 kB
---
datasets:
- rebel-short
metrics:
- rouge
model-index:
- name: flan-t5-base
results:
- task:
name: Sequence-to-sequence Language Modeling
type: text2text-generation
dataset:
name: rebel-short
type: rebel-short
config: default
split: test
args: default
metrics:
- name: Rouge1
type: rouge
value: 51.5716
license: cc-by-sa-4.0
language:
- nl
pipeline_tag: text2text-generation
library_name: transformers
---
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->
# flan-rebel-nl
This model is a fine-tuned version of flan-t5-base on the rebel-short dataset.
It achieves the following results on the evaluation set:
- Loss: 0.1029
- Rouge1: 51.5716
- Rouge2: 40.2152
- Rougel: 49.9941
- Rougelsum: 49.9767
- Gen Len: 18.5898
## Model description
This is a flan-t5-base model fine-tuned on a Dutch dataset version based on RBEL: Relation Extraction By End-to-end Language generation. The model aims to extract triplets in the form {head, relation, tail} from unstructured text. The data for Dutch triplets and unstructured text was generated by using the code of the original authors of REBEL, available at https://github.com/Babelscape/crocodile.
## Pipeline usage
The code below is adopted from the original REBEL model: https://huggingface.co/Babelscape/rebel-large .
```python
from transformers import pipeline
triplet_extractor = pipeline('text2text-generation', model='Kbrek/flan_rebel_nl', tokenizer='Kbrek/flan_rebel_nl')
# We need to use the tokenizer manually since we need special tokens.
extracted_text = triplet_extractor("Nederland is een van de landen binnen het Koninkrijk der Nederlanden. Nederland ligt voor het overgrote deel in het noordwesten van Europa, aan de Noordzee. ", max_length = 512, num_beams = 3, temperature = 1)
# Function to parse the generated text and extract the triplets
def extract_triplets(text):
triplets = []
relation, subject, relation, object_ = '', '', '', ''
text = text.strip()
current = 'x'
for token in text.replace("<s>", "").replace("<pad>", "").replace("</s>", "").split():
if token == "<triplet>":
current = 't'
if relation != '':
triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
relation = ''
subject = ''
elif token == "<subj>":
current = 's'
if relation != '':
triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
object_ = ''
elif token == "<obj>":
current = 'o'
relation = ''
else:
if current == 't':
subject += ' ' + token
elif current == 's':
object_ += ' ' + token
elif current == 'o':
relation += ' ' + token
if subject != '' and relation != '' and object_ != '':
triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
return triplets
extracted_triplets = extract_triplets(extracted_text[0])
print(extracted_triplets)
```
A trick that might give you better results is by forcing the entities the model generates by extracting entities with a ner pipeline and forcing those tokens in the generated output.
```python
triplet_extractor = pipeline('text2text-generation', model='Kbrek/flan_rebel_nl', tokenizer='Kbrek/flan_rebel_nl')
ner_extractor = pipeline("ner", "Babelscape/wikineural-multilingual-ner", aggregation_strategy = "simple")
#extract ents
ner_output = ner_extractor(input_text)
ents = [i["word"] for i in ner_output]
if len(ents) > 0:
tokens = triplet_extractor.tokenizer(ents, add_special_tokens=False)["input_ids"]
extracted_text = triplet_extractor(input_text, max_length = 512, force_words_ids = tokens)
else:
extracted_text = triplet_extractor(input_text, max_length = 512, temperature = 1)
triplets = extract_triplets(extracted_text[0]["generated_text"])
```
## Training and evaluation data
Data used for developing and evaluating this model is generated by using https://github.com/Babelscape/crocodile .
## Training procedure
### Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 4
- eval_batch_size: 4
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 5
### Training results
| Training Loss | Epoch | Step | Validation Loss | Rouge1 | Rouge2 | Rougel | Rougelsum | Gen Len |
|:-------------:|:-----:|:------:|:---------------:|:-------:|:-------:|:-------:|:---------:|:-------:|
| 0.1256 | 1.0 | 22047 | 0.1206 | 50.3892 | 38.2761 | 48.7657 | 48.7444 | 18.6112 |
| 0.1091 | 2.0 | 44094 | 0.1112 | 50.9615 | 39.2843 | 49.3865 | 49.3674 | 18.5447 |
| 0.0875 | 3.0 | 66141 | 0.1047 | 51.2045 | 39.7598 | 49.6483 | 49.6317 | 18.5763 |
| 0.0841 | 4.0 | 88188 | 0.1036 | 51.3543 | 39.9776 | 49.8528 | 49.8223 | 18.6178 |
| 0.0806 | 5.0 | 110235 | 0.1029 | 51.5716 | 40.2152 | 49.9941 | 49.9767 | 18.5898 |
### Framework versions
- Transformers 4.27.2
- Pytorch 1.13.1+cu117
- Datasets 2.10.1
- Tokenizers 0.12.1