--- datasets: - rebel-short metrics: - rouge model-index: - name: flan-t5-base results: - task: name: Sequence-to-sequence Language Modeling type: text2text-generation dataset: name: rebel-short type: rebel-short config: default split: test args: default metrics: - name: Rouge1 type: rouge value: 51.5716 license: cc-by-sa-4.0 language: - nl pipeline_tag: text2text-generation --- # flan-rebel-nl This model is a fine-tuned version of flan-t5-base on the rebel-short dataset. It achieves the following results on the evaluation set: - Loss: 0.1029 - Rouge1: 51.5716 - Rouge2: 40.2152 - Rougel: 49.9941 - Rougelsum: 49.9767 - Gen Len: 18.5898 ## Model description This is a flan-t5-base model fine-tuned on a Dutch dataset version based on RBEL: Relation Extraction By End-to-end Language generation. The model aims to extract triplets in the form {head, relation, tail} from unstructured text. The data for Dutch triplets and unstructured text was generated by using the code of the original authors of REBEL, available at https://github.com/Babelscape/crocodile. ## Pipeline usage The code below is adopted from the original REBEL model: https://huggingface.co/Babelscape/rebel-large . ```python from transformers import pipeline triplet_extractor = pipeline('text2text-generation', model='Kbrek/flan_rebel_nl', tokenizer='Kbrek/flan_rebel_nl') # We need to use the tokenizer manually since we need special tokens. extracted_text = triplet_extractor("Nederland is een van de landen binnen het Koninkrijk der Nederlanden. Nederland ligt voor het overgrote deel in het noordwesten van Europa, aan de Noordzee. ", max_length = 512, num_beams = 3, temperature = 1) # Function to parse the generated text and extract the triplets def extract_triplets(text): triplets = [] relation, subject, relation, object_ = '', '', '', '' text = text.strip() current = 'x' for token in text.replace("", "").replace("", "").replace("", "").split(): if token == "": current = 't' if relation != '': triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()}) relation = '' subject = '' elif token == "": current = 's' if relation != '': triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()}) object_ = '' elif token == "": current = 'o' relation = '' else: if current == 't': subject += ' ' + token elif current == 's': object_ += ' ' + token elif current == 'o': relation += ' ' + token if subject != '' and relation != '' and object_ != '': triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()}) return triplets extracted_triplets = extract_triplets(extracted_text[0]) print(extracted_triplets) ``` A trick that might give you better results is by forcing the entities the model generates by extracting entities with a ner pipeline and forcing those tokens in the generated output. ```python triplet_extractor = pipeline('text2text-generation', model='Kbrek/flan_rebel_nl', tokenizer='Kbrek/flan_rebel_nl') ner_extractor = pipeline("ner", "Babelscape/wikineural-multilingual-ner", aggregation_strategy = "simple") #extract ents ner_output = ner_extractor(input_text) ents = [i["word"] for i in ner_output] if len(ents) > 0: tokens = triplet_extractor.tokenizer(ents, add_special_tokens=False)["input_ids"] extracted_text = triplet_extractor(input_text, max_length = 512, force_words_ids = tokens) else: extracted_text = triplet_extractor(input_text, max_length = 512, temperature = 1) triplets = extract_triplets(extracted_text[0]["generated_text"]) ``` ## Training and evaluation data Data used for developing and evaluating this model is generated by using https://github.com/Babelscape/crocodile . ## Training procedure ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 5e-05 - train_batch_size: 4 - eval_batch_size: 4 - seed: 42 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: linear - num_epochs: 5 ### Training results | Training Loss | Epoch | Step | Validation Loss | Rouge1 | Rouge2 | Rougel | Rougelsum | Gen Len | |:-------------:|:-----:|:------:|:---------------:|:-------:|:-------:|:-------:|:---------:|:-------:| | 0.1256 | 1.0 | 22047 | 0.1206 | 50.3892 | 38.2761 | 48.7657 | 48.7444 | 18.6112 | | 0.1091 | 2.0 | 44094 | 0.1112 | 50.9615 | 39.2843 | 49.3865 | 49.3674 | 18.5447 | | 0.0875 | 3.0 | 66141 | 0.1047 | 51.2045 | 39.7598 | 49.6483 | 49.6317 | 18.5763 | | 0.0841 | 4.0 | 88188 | 0.1036 | 51.3543 | 39.9776 | 49.8528 | 49.8223 | 18.6178 | | 0.0806 | 5.0 | 110235 | 0.1029 | 51.5716 | 40.2152 | 49.9941 | 49.9767 | 18.5898 | ### Framework versions - Transformers 4.27.2 - Pytorch 1.13.1+cu117 - Datasets 2.10.1 - Tokenizers 0.12.1