Kbrek
/

flan_rebel_nl

@@ -1,12 +1,10 @@
 ---
-tags:
-- generated_from_trainer
 datasets:
 - rebel-short
 metrics:
 - rouge
 model-index:
-- name: flan-t5-base-samsum
   results:
   - task:
       name: Sequence-to-sequence Language Modeling
@@ -21,6 +19,10 @@ model-index:
     - name: Rouge1
       type: rouge
       value: 51.5716
 ---
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
@@ -28,7 +30,7 @@ should probably proofread and complete it, then remove this comment. -->
 # flan-t5-base-samsum
-This model is a fine-tuned version of [flan-t5-base-samsum\checkpoint-1658](https://huggingface.co/flan-t5-base-samsum\checkpoint-1658) on the rebel-short dataset.
 It achieves the following results on the evaluation set:
 - Loss: 0.1029
 - Rouge1: 51.5716
@@ -39,15 +41,80 @@ It achieves the following results on the evaluation set:
 ## Model description
-More information needed
-## Intended uses & limitations
-More information needed
 ## Training and evaluation data
-More information needed
 ## Training procedure
@@ -78,4 +145,4 @@ The following hyperparameters were used during training:
 - Transformers 4.27.2
 - Pytorch 1.13.1+cu117
 - Datasets 2.10.1
-- Tokenizers 0.12.1

 ---
 datasets:
 - rebel-short
 metrics:
 - rouge
 model-index:
+- name: flan-t5-base
   results:
   - task:
       name: Sequence-to-sequence Language Modeling
     - name: Rouge1
       type: rouge
       value: 51.5716
+license: cc-by-sa-4.0
+language:
+- nl
+pipeline_tag: text2text-generation
 ---
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 # flan-t5-base-samsum
+This model is a fine-tuned version of flan-t5-base on the rebel-short dataset.
 It achieves the following results on the evaluation set:
 - Loss: 0.1029
 - Rouge1: 51.5716
 ## Model description
+This is a flan-t5-base model fine-tuned on a Dutch dataset version based on RBEL: Relation Extraction By End-to-end Language generation. The model aims to extract triplets in the form {head, relation, tail} from unstructured text. The data for Dutch triplets and unstructured text was generated by using the code of the original authors of REBEL, available at https://github.com/Babelscape/crocodile.
+## Pipeline usage
+The code below is adopted from the original REBEL model: https://huggingface.co/Babelscape/rebel-large .
+```python
+from transformers import pipeline
+triplet_extractor = pipeline('text2text-generation', model='Kbrek/flan_rebel_nl', tokenizer='Kbrek/flan_rebel_nl')
+# We need to use the tokenizer manually since we need special tokens.
+extracted_text = triplet_extractor("Nederland is een van de landen binnen het Koninkrijk der Nederlanden. Nederland ligt voor het overgrote deel in het noordwesten van Europa, aan de Noordzee. ", max_length = 512, num_beams = 3, temperature = 1)
+# Function to parse the generated text and extract the triplets
+def extract_triplets(text):
+    triplets = []
+    relation, subject, relation, object_ = '', '', '', ''
+    text = text.strip()
+    current = 'x'
+    for token in text.replace("<s>", "").replace("<pad>", "").replace("</s>", "").split():
+        if token == "<triplet>":
+            current = 't'
+            if relation != '':
+                triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
+                relation = ''
+            subject = ''
+        elif token == "<subj>":
+            current = 's'
+            if relation != '':
+                triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
+            object_ = ''
+        elif token == "<obj>":
+            current = 'o'
+            relation = ''
+        else:
+            if current == 't':
+                subject += ' ' + token
+            elif current == 's':
+                object_ += ' ' + token
+            elif current == 'o':
+                relation += ' ' + token
+    if subject != '' and relation != '' and object_ != '':
+        triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
+    return triplets
+extracted_triplets = extract_triplets(extracted_text[0])
+print(extracted_triplets)
+```
+A trick that might give you better results is by forcing the entities the model generates by extracting entities with a ner pipeline and forcing those tokens in the generated output.
+```python
+triplet_extractor = pipeline('text2text-generation', model='Kbrek/flan_rebel_nl', tokenizer='Kbrek/flan_rebel_nl')
+ner_extractor = pipeline("ner", "Babelscape/wikineural-multilingual-ner", aggregation_strategy = "simple")
+#extract ents
+ner_output = ner_extractor(input_text)
+ents = [i["word"] for i in ner_output]
+if len(ents) > 0:
+    tokens = triplet_extractor.tokenizer(ents, add_special_tokens=False)["input_ids"]
+    extracted_text = triplet_extractor(input_text, max_length = 512, force_words_ids = tokens)
+else:
+    extracted_text = triplet_extractor(input_text, max_length = 512, temperature = 1)
+triplets = extract_triplets(extracted_text[0]["generated_text"])
+```
 ## Training and evaluation data
+Data used for developing and evaluating this model is generated by using https://github.com/Babelscape/crocodile .
 ## Training procedure
 - Transformers 4.27.2
 - Pytorch 1.13.1+cu117
 - Datasets 2.10.1
+- Tokenizers 0.12.1