Missing spaces between chunks in longform fine tune outputs & importance of tokenizer.json
Hi all, following Whisper fine tune event training guide and the models produced certainly have increased accuracy, however, in concatenating the <30s chunks to form the longform text output of the whole transcript, sometimes the space between words at the beginning/end of chunks concatenated are missing.
Eg:
Chunk 1: Bob went to the beach.
Chunk 2: It was a very sunny day.
Chunk 3: He put on sunscreen
Chunk 4: to protect his skin from sunburn.
Transcript Text: Bob went to the beach.It was a very sunny day. He put on sunscreento protect his skin from sunburn.
Does not occur 100% of the time. Most common after "."
Does not occur with stock model with same inference script (Pipeline).
I notice the fine tune script does not produce a tokenizer.json as it is using the slow tokenizer, but the stock models on huggingface do include this tokenizer.json?
In the stock model tokenizer.json, there is a object that is:
"decoder": {
"type": "ByteLevel",
"add_prefix_space": true,
"trim_offsets": true,
"use_regex": true
Trialed fine tune model with and without this tokenizer.json, however, and there was no difference in outputs.
Last observation was in the fine tune script, the trainer arguments are:
trainer = Seq2SeqTrainer(
model=model,
args=training_args,
train_dataset=dataset["train"],
eval_dataset=dataset["test"],
data_collator=data_collator,
compute_metrics=compute_metrics,
tokenizer=processor.feature_extractor
)
Should that last line be tokenizer=processor.tokenizer ? Have tested with this change and the script still runs fine, so assuming Whisper processer just picks the tokenizer regardless of whether feature extractor or tokenizer module is specified there.
Can anyone think of a cause for the missing spaces issue?
Thanks!