Update README.md
Browse files
README.md
CHANGED
@@ -1,12 +1,10 @@
|
|
1 |
---
|
2 |
-
tags:
|
3 |
-
- generated_from_trainer
|
4 |
datasets:
|
5 |
- rebel-short
|
6 |
metrics:
|
7 |
- rouge
|
8 |
model-index:
|
9 |
-
- name: flan-t5-base
|
10 |
results:
|
11 |
- task:
|
12 |
name: Sequence-to-sequence Language Modeling
|
@@ -21,6 +19,10 @@ model-index:
|
|
21 |
- name: Rouge1
|
22 |
type: rouge
|
23 |
value: 51.5716
|
|
|
|
|
|
|
|
|
24 |
---
|
25 |
|
26 |
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
|
@@ -28,7 +30,7 @@ should probably proofread and complete it, then remove this comment. -->
|
|
28 |
|
29 |
# flan-t5-base-samsum
|
30 |
|
31 |
-
This model is a fine-tuned version of
|
32 |
It achieves the following results on the evaluation set:
|
33 |
- Loss: 0.1029
|
34 |
- Rouge1: 51.5716
|
@@ -39,15 +41,80 @@ It achieves the following results on the evaluation set:
|
|
39 |
|
40 |
## Model description
|
41 |
|
42 |
-
|
43 |
-
|
44 |
-
|
45 |
-
|
46 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
47 |
|
48 |
## Training and evaluation data
|
49 |
|
50 |
-
|
51 |
|
52 |
## Training procedure
|
53 |
|
@@ -78,4 +145,4 @@ The following hyperparameters were used during training:
|
|
78 |
- Transformers 4.27.2
|
79 |
- Pytorch 1.13.1+cu117
|
80 |
- Datasets 2.10.1
|
81 |
-
- Tokenizers 0.12.1
|
|
|
1 |
---
|
|
|
|
|
2 |
datasets:
|
3 |
- rebel-short
|
4 |
metrics:
|
5 |
- rouge
|
6 |
model-index:
|
7 |
+
- name: flan-t5-base
|
8 |
results:
|
9 |
- task:
|
10 |
name: Sequence-to-sequence Language Modeling
|
|
|
19 |
- name: Rouge1
|
20 |
type: rouge
|
21 |
value: 51.5716
|
22 |
+
license: cc-by-sa-4.0
|
23 |
+
language:
|
24 |
+
- nl
|
25 |
+
pipeline_tag: text2text-generation
|
26 |
---
|
27 |
|
28 |
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
|
|
|
30 |
|
31 |
# flan-t5-base-samsum
|
32 |
|
33 |
+
This model is a fine-tuned version of flan-t5-base on the rebel-short dataset.
|
34 |
It achieves the following results on the evaluation set:
|
35 |
- Loss: 0.1029
|
36 |
- Rouge1: 51.5716
|
|
|
41 |
|
42 |
## Model description
|
43 |
|
44 |
+
This is a flan-t5-base model fine-tuned on a Dutch dataset version based on RBEL: Relation Extraction By End-to-end Language generation. The model aims to extract triplets in the form {head, relation, tail} from unstructured text. The data for Dutch triplets and unstructured text was generated by using the code of the original authors of REBEL, available at https://github.com/Babelscape/crocodile.
|
45 |
+
|
46 |
+
|
47 |
+
## Pipeline usage
|
48 |
+
|
49 |
+
The code below is adopted from the original REBEL model: https://huggingface.co/Babelscape/rebel-large .
|
50 |
+
|
51 |
+
```python
|
52 |
+
from transformers import pipeline
|
53 |
+
|
54 |
+
triplet_extractor = pipeline('text2text-generation', model='Kbrek/flan_rebel_nl', tokenizer='Kbrek/flan_rebel_nl')
|
55 |
+
# We need to use the tokenizer manually since we need special tokens.
|
56 |
+
extracted_text = triplet_extractor("Nederland is een van de landen binnen het Koninkrijk der Nederlanden. Nederland ligt voor het overgrote deel in het noordwesten van Europa, aan de Noordzee. ", max_length = 512, num_beams = 3, temperature = 1)
|
57 |
+
# Function to parse the generated text and extract the triplets
|
58 |
+
def extract_triplets(text):
|
59 |
+
triplets = []
|
60 |
+
relation, subject, relation, object_ = '', '', '', ''
|
61 |
+
text = text.strip()
|
62 |
+
current = 'x'
|
63 |
+
for token in text.replace("<s>", "").replace("<pad>", "").replace("</s>", "").split():
|
64 |
+
if token == "<triplet>":
|
65 |
+
current = 't'
|
66 |
+
if relation != '':
|
67 |
+
triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
|
68 |
+
relation = ''
|
69 |
+
subject = ''
|
70 |
+
elif token == "<subj>":
|
71 |
+
current = 's'
|
72 |
+
if relation != '':
|
73 |
+
triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
|
74 |
+
object_ = ''
|
75 |
+
elif token == "<obj>":
|
76 |
+
current = 'o'
|
77 |
+
relation = ''
|
78 |
+
else:
|
79 |
+
if current == 't':
|
80 |
+
subject += ' ' + token
|
81 |
+
elif current == 's':
|
82 |
+
object_ += ' ' + token
|
83 |
+
elif current == 'o':
|
84 |
+
relation += ' ' + token
|
85 |
+
if subject != '' and relation != '' and object_ != '':
|
86 |
+
triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
|
87 |
+
return triplets
|
88 |
+
extracted_triplets = extract_triplets(extracted_text[0])
|
89 |
+
print(extracted_triplets)
|
90 |
+
```
|
91 |
+
|
92 |
+
A trick that might give you better results is by forcing the entities the model generates by extracting entities with a ner pipeline and forcing those tokens in the generated output.
|
93 |
+
|
94 |
+
```python
|
95 |
+
|
96 |
+
triplet_extractor = pipeline('text2text-generation', model='Kbrek/flan_rebel_nl', tokenizer='Kbrek/flan_rebel_nl')
|
97 |
+
ner_extractor = pipeline("ner", "Babelscape/wikineural-multilingual-ner", aggregation_strategy = "simple")
|
98 |
+
|
99 |
+
#extract ents
|
100 |
+
ner_output = ner_extractor(input_text)
|
101 |
+
ents = [i["word"] for i in ner_output]
|
102 |
+
|
103 |
+
if len(ents) > 0:
|
104 |
+
|
105 |
+
tokens = triplet_extractor.tokenizer(ents, add_special_tokens=False)["input_ids"]
|
106 |
+
extracted_text = triplet_extractor(input_text, max_length = 512, force_words_ids = tokens)
|
107 |
+
|
108 |
+
else:
|
109 |
+
extracted_text = triplet_extractor(input_text, max_length = 512, temperature = 1)
|
110 |
+
triplets = extract_triplets(extracted_text[0]["generated_text"])
|
111 |
+
|
112 |
+
|
113 |
+
```
|
114 |
|
115 |
## Training and evaluation data
|
116 |
|
117 |
+
Data used for developing and evaluating this model is generated by using https://github.com/Babelscape/crocodile .
|
118 |
|
119 |
## Training procedure
|
120 |
|
|
|
145 |
- Transformers 4.27.2
|
146 |
- Pytorch 1.13.1+cu117
|
147 |
- Datasets 2.10.1
|
148 |
+
- Tokenizers 0.12.1
|