flan_rebel_nl / README.md

Update README.md

9ea7e95 over 1 year ago

5.47 kB

	---
	datasets:
	- rebel-short
	metrics:
	- rouge
	model-index:
	- name: flan-t5-base
	results:
	- task:
	name: Sequence-to-sequence Language Modeling
	type: text2text-generation
	dataset:
	name: rebel-short
	type: rebel-short
	config: default
	split: test
	args: default
	metrics:
	- name: Rouge1
	type: rouge
	value: 51.5716
	license: cc-by-sa-4.0
	language:
	- nl
	pipeline_tag: text2text-generation
	---

	<!-- This model card has been generated automatically according to the information the Trainer had access to. You
	should probably proofread and complete it, then remove this comment. -->

	# flan-rebel-nl

	This model is a fine-tuned version of flan-t5-base on the rebel-short dataset.
	It achieves the following results on the evaluation set:
	- Loss: 0.1029
	- Rouge1: 51.5716
	- Rouge2: 40.2152
	- Rougel: 49.9941
	- Rougelsum: 49.9767
	- Gen Len: 18.5898

	## Model description

	This is a flan-t5-base model fine-tuned on a Dutch dataset version based on RBEL: Relation Extraction By End-to-end Language generation. The model aims to extract triplets in the form {head, relation, tail} from unstructured text. The data for Dutch triplets and unstructured text was generated by using the code of the original authors of REBEL, available at https://github.com/Babelscape/crocodile.


	## Pipeline usage

	The code below is adopted from the original REBEL model: https://huggingface.co/Babelscape/rebel-large .

	```python
	from transformers import pipeline

	triplet_extractor = pipeline('text2text-generation', model='Kbrek/flan_rebel_nl', tokenizer='Kbrek/flan_rebel_nl')
	# We need to use the tokenizer manually since we need special tokens.
	extracted_text = triplet_extractor("Nederland is een van de landen binnen het Koninkrijk der Nederlanden. Nederland ligt voor het overgrote deel in het noordwesten van Europa, aan de Noordzee. ", max_length = 512, num_beams = 3, temperature = 1)
	# Function to parse the generated text and extract the triplets
	def extract_triplets(text):
	triplets = []
	relation, subject, relation, object_ = '', '', '', ''
	text = text.strip()
	current = 'x'
	for token in text.replace("<s>", "").replace("<pad>", "").replace("</s>", "").split():
	if token == "<triplet>":
	current = 't'
	if relation != '':
	triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
	relation = ''
	subject = ''
	elif token == "<subj>":
	current = 's'
	if relation != '':
	triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
	object_ = ''
	elif token == "<obj>":
	current = 'o'
	relation = ''
	else:
	if current == 't':
	subject += ' ' + token
	elif current == 's':
	object_ += ' ' + token
	elif current == 'o':
	relation += ' ' + token
	if subject != '' and relation != '' and object_ != '':
	triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
	return triplets
	extracted_triplets = extract_triplets(extracted_text[0])
	print(extracted_triplets)
	```

	A trick that might give you better results is by forcing the entities the model generates by extracting entities with a ner pipeline and forcing those tokens in the generated output.

	```python

	triplet_extractor = pipeline('text2text-generation', model='Kbrek/flan_rebel_nl', tokenizer='Kbrek/flan_rebel_nl')
	ner_extractor = pipeline("ner", "Babelscape/wikineural-multilingual-ner", aggregation_strategy = "simple")

	#extract ents
	ner_output = ner_extractor(input_text)
	ents = [i["word"] for i in ner_output]

	if len(ents) > 0:

	tokens = triplet_extractor.tokenizer(ents, add_special_tokens=False)["input_ids"]
	extracted_text = triplet_extractor(input_text, max_length = 512, force_words_ids = tokens)

	else:
	extracted_text = triplet_extractor(input_text, max_length = 512, temperature = 1)
	triplets = extract_triplets(extracted_text[0]["generated_text"])


	```

	## Training and evaluation data

	Data used for developing and evaluating this model is generated by using https://github.com/Babelscape/crocodile .

	## Training procedure

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 5e-05
	- train_batch_size: 4
	- eval_batch_size: 4
	- seed: 42
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: linear
	- num_epochs: 5

	### Training results

	\| Training Loss \| Epoch \| Step \| Validation Loss \| Rouge1 \| Rouge2 \| Rougel \| Rougelsum \| Gen Len \|
	\|:-------------:\|:-----:\|:------:\|:---------------:\|:-------:\|:-------:\|:-------:\|:---------:\|:-------:\|
	\| 0.1256 \| 1.0 \| 22047 \| 0.1206 \| 50.3892 \| 38.2761 \| 48.7657 \| 48.7444 \| 18.6112 \|
	\| 0.1091 \| 2.0 \| 44094 \| 0.1112 \| 50.9615 \| 39.2843 \| 49.3865 \| 49.3674 \| 18.5447 \|
	\| 0.0875 \| 3.0 \| 66141 \| 0.1047 \| 51.2045 \| 39.7598 \| 49.6483 \| 49.6317 \| 18.5763 \|
	\| 0.0841 \| 4.0 \| 88188 \| 0.1036 \| 51.3543 \| 39.9776 \| 49.8528 \| 49.8223 \| 18.6178 \|
	\| 0.0806 \| 5.0 \| 110235 \| 0.1029 \| 51.5716 \| 40.2152 \| 49.9941 \| 49.9767 \| 18.5898 \|


	### Framework versions

	- Transformers 4.27.2
	- Pytorch 1.13.1+cu117
	- Datasets 2.10.1
	- Tokenizers 0.12.1