Kbrek commited on
Commit
813b13c
1 Parent(s): 66022a9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +78 -11
README.md CHANGED
@@ -1,12 +1,10 @@
1
  ---
2
- tags:
3
- - generated_from_trainer
4
  datasets:
5
  - rebel-short
6
  metrics:
7
  - rouge
8
  model-index:
9
- - name: flan-t5-base-samsum
10
  results:
11
  - task:
12
  name: Sequence-to-sequence Language Modeling
@@ -21,6 +19,10 @@ model-index:
21
  - name: Rouge1
22
  type: rouge
23
  value: 51.5716
 
 
 
 
24
  ---
25
 
26
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
@@ -28,7 +30,7 @@ should probably proofread and complete it, then remove this comment. -->
28
 
29
  # flan-t5-base-samsum
30
 
31
- This model is a fine-tuned version of [flan-t5-base-samsum\checkpoint-1658](https://huggingface.co/flan-t5-base-samsum\checkpoint-1658) on the rebel-short dataset.
32
  It achieves the following results on the evaluation set:
33
  - Loss: 0.1029
34
  - Rouge1: 51.5716
@@ -39,15 +41,80 @@ It achieves the following results on the evaluation set:
39
 
40
  ## Model description
41
 
42
- More information needed
43
-
44
- ## Intended uses & limitations
45
-
46
- More information needed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
47
 
48
  ## Training and evaluation data
49
 
50
- More information needed
51
 
52
  ## Training procedure
53
 
@@ -78,4 +145,4 @@ The following hyperparameters were used during training:
78
  - Transformers 4.27.2
79
  - Pytorch 1.13.1+cu117
80
  - Datasets 2.10.1
81
- - Tokenizers 0.12.1
 
1
  ---
 
 
2
  datasets:
3
  - rebel-short
4
  metrics:
5
  - rouge
6
  model-index:
7
+ - name: flan-t5-base
8
  results:
9
  - task:
10
  name: Sequence-to-sequence Language Modeling
 
19
  - name: Rouge1
20
  type: rouge
21
  value: 51.5716
22
+ license: cc-by-sa-4.0
23
+ language:
24
+ - nl
25
+ pipeline_tag: text2text-generation
26
  ---
27
 
28
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 
30
 
31
  # flan-t5-base-samsum
32
 
33
+ This model is a fine-tuned version of flan-t5-base on the rebel-short dataset.
34
  It achieves the following results on the evaluation set:
35
  - Loss: 0.1029
36
  - Rouge1: 51.5716
 
41
 
42
  ## Model description
43
 
44
+ This is a flan-t5-base model fine-tuned on a Dutch dataset version based on RBEL: Relation Extraction By End-to-end Language generation. The model aims to extract triplets in the form {head, relation, tail} from unstructured text. The data for Dutch triplets and unstructured text was generated by using the code of the original authors of REBEL, available at https://github.com/Babelscape/crocodile.
45
+
46
+
47
+ ## Pipeline usage
48
+
49
+ The code below is adopted from the original REBEL model: https://huggingface.co/Babelscape/rebel-large .
50
+
51
+ ```python
52
+ from transformers import pipeline
53
+
54
+ triplet_extractor = pipeline('text2text-generation', model='Kbrek/flan_rebel_nl', tokenizer='Kbrek/flan_rebel_nl')
55
+ # We need to use the tokenizer manually since we need special tokens.
56
+ extracted_text = triplet_extractor("Nederland is een van de landen binnen het Koninkrijk der Nederlanden. Nederland ligt voor het overgrote deel in het noordwesten van Europa, aan de Noordzee. ", max_length = 512, num_beams = 3, temperature = 1)
57
+ # Function to parse the generated text and extract the triplets
58
+ def extract_triplets(text):
59
+ triplets = []
60
+ relation, subject, relation, object_ = '', '', '', ''
61
+ text = text.strip()
62
+ current = 'x'
63
+ for token in text.replace("<s>", "").replace("<pad>", "").replace("</s>", "").split():
64
+ if token == "<triplet>":
65
+ current = 't'
66
+ if relation != '':
67
+ triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
68
+ relation = ''
69
+ subject = ''
70
+ elif token == "<subj>":
71
+ current = 's'
72
+ if relation != '':
73
+ triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
74
+ object_ = ''
75
+ elif token == "<obj>":
76
+ current = 'o'
77
+ relation = ''
78
+ else:
79
+ if current == 't':
80
+ subject += ' ' + token
81
+ elif current == 's':
82
+ object_ += ' ' + token
83
+ elif current == 'o':
84
+ relation += ' ' + token
85
+ if subject != '' and relation != '' and object_ != '':
86
+ triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
87
+ return triplets
88
+ extracted_triplets = extract_triplets(extracted_text[0])
89
+ print(extracted_triplets)
90
+ ```
91
+
92
+ A trick that might give you better results is by forcing the entities the model generates by extracting entities with a ner pipeline and forcing those tokens in the generated output.
93
+
94
+ ```python
95
+
96
+ triplet_extractor = pipeline('text2text-generation', model='Kbrek/flan_rebel_nl', tokenizer='Kbrek/flan_rebel_nl')
97
+ ner_extractor = pipeline("ner", "Babelscape/wikineural-multilingual-ner", aggregation_strategy = "simple")
98
+
99
+ #extract ents
100
+ ner_output = ner_extractor(input_text)
101
+ ents = [i["word"] for i in ner_output]
102
+
103
+ if len(ents) > 0:
104
+
105
+ tokens = triplet_extractor.tokenizer(ents, add_special_tokens=False)["input_ids"]
106
+ extracted_text = triplet_extractor(input_text, max_length = 512, force_words_ids = tokens)
107
+
108
+ else:
109
+ extracted_text = triplet_extractor(input_text, max_length = 512, temperature = 1)
110
+ triplets = extract_triplets(extracted_text[0]["generated_text"])
111
+
112
+
113
+ ```
114
 
115
  ## Training and evaluation data
116
 
117
+ Data used for developing and evaluating this model is generated by using https://github.com/Babelscape/crocodile .
118
 
119
  ## Training procedure
120
 
 
145
  - Transformers 4.27.2
146
  - Pytorch 1.13.1+cu117
147
  - Datasets 2.10.1
148
+ - Tokenizers 0.12.1