File size: 5,469 Bytes

---
datasets:
- rebel-short
metrics:
- rouge
model-index:
- name: flan-t5-base
  results:
  - task:
      name: Sequence-to-sequence Language Modeling
      type: text2text-generation
    dataset:
      name: rebel-short
      type: rebel-short
      config: default
      split: test
      args: default
    metrics:
    - name: Rouge1
      type: rouge
      value: 51.5716
license: cc-by-sa-4.0
language:
- nl
pipeline_tag: text2text-generation
---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

# flan-rebel-nl

This model is a fine-tuned version of flan-t5-base on the rebel-short dataset.
It achieves the following results on the evaluation set:
- Loss: 0.1029
- Rouge1: 51.5716
- Rouge2: 40.2152
- Rougel: 49.9941
- Rougelsum: 49.9767
- Gen Len: 18.5898

## Model description

This is a flan-t5-base model fine-tuned on a Dutch dataset version based on RBEL: Relation Extraction By End-to-end Language generation. The model aims to extract triplets in the form {head, relation, tail} from unstructured text. The data for Dutch triplets and unstructured text was generated by using the code of the original authors of REBEL, available at https://github.com/Babelscape/crocodile.


## Pipeline usage

The code below is adopted from the original REBEL model: https://huggingface.co/Babelscape/rebel-large .

```python
from transformers import pipeline

triplet_extractor = pipeline('text2text-generation', model='Kbrek/flan_rebel_nl', tokenizer='Kbrek/flan_rebel_nl')
# We need to use the tokenizer manually since we need special tokens.
extracted_text = triplet_extractor("Nederland is een van de landen binnen het Koninkrijk der Nederlanden. Nederland ligt voor het overgrote deel in het noordwesten van Europa, aan de Noordzee. ", max_length = 512, num_beams = 3, temperature = 1)
# Function to parse the generated text and extract the triplets
def extract_triplets(text):
    triplets = []
    relation, subject, relation, object_ = '', '', '', ''
    text = text.strip()
    current = 'x'
    for token in text.replace("<s>", "").replace("<pad>", "").replace("</s>", "").split():
        if token == "<triplet>":
            current = 't'
            if relation != '':
                triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
                relation = ''
            subject = ''
        elif token == "<subj>":
            current = 's'
            if relation != '':
                triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
            object_ = ''
        elif token == "<obj>":
            current = 'o'
            relation = ''
        else:
            if current == 't':
                subject += ' ' + token
            elif current == 's':
                object_ += ' ' + token
            elif current == 'o':
                relation += ' ' + token
    if subject != '' and relation != '' and object_ != '':
        triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
    return triplets
extracted_triplets = extract_triplets(extracted_text[0])
print(extracted_triplets)
```

A trick that might give you better results is by forcing the entities the model generates by extracting entities with a ner pipeline and forcing those tokens in the generated output.

```python

triplet_extractor = pipeline('text2text-generation', model='Kbrek/flan_rebel_nl', tokenizer='Kbrek/flan_rebel_nl')
ner_extractor = pipeline("ner", "Babelscape/wikineural-multilingual-ner", aggregation_strategy = "simple")

#extract ents
ner_output = ner_extractor(input_text)
ents = [i["word"] for i in ner_output]

if len(ents) > 0:

    tokens = triplet_extractor.tokenizer(ents, add_special_tokens=False)["input_ids"]
    extracted_text = triplet_extractor(input_text, max_length = 512, force_words_ids = tokens)

else:
    extracted_text = triplet_extractor(input_text, max_length = 512, temperature = 1)
triplets = extract_triplets(extracted_text[0]["generated_text"])


```

## Training and evaluation data

Data used for developing and evaluating this model is generated by using https://github.com/Babelscape/crocodile .

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 4
- eval_batch_size: 4
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 5

### Training results

| Training Loss | Epoch | Step   | Validation Loss | Rouge1  | Rouge2  | Rougel  | Rougelsum | Gen Len |
|:-------------:|:-----:|:------:|:---------------:|:-------:|:-------:|:-------:|:---------:|:-------:|
| 0.1256        | 1.0   | 22047  | 0.1206          | 50.3892 | 38.2761 | 48.7657 | 48.7444   | 18.6112 |
| 0.1091        | 2.0   | 44094  | 0.1112          | 50.9615 | 39.2843 | 49.3865 | 49.3674   | 18.5447 |
| 0.0875        | 3.0   | 66141  | 0.1047          | 51.2045 | 39.7598 | 49.6483 | 49.6317   | 18.5763 |
| 0.0841        | 4.0   | 88188  | 0.1036          | 51.3543 | 39.9776 | 49.8528 | 49.8223   | 18.6178 |
| 0.0806        | 5.0   | 110235 | 0.1029          | 51.5716 | 40.2152 | 49.9941 | 49.9767   | 18.5898 |


### Framework versions

- Transformers 4.27.2
- Pytorch 1.13.1+cu117
- Datasets 2.10.1
- Tokenizers 0.12.1