File size: 3,071 Bytes
9f7f573
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
90966f7
 
 
 
9f7f573
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import pandas as pd
from datasets import load_dataset
import streamlit as st

from clarin_datasets.dataset_to_show import DatasetToShow


class PunctuationRestorationDataset(DatasetToShow):
    def __init__(self):
        DatasetToShow.__init__(self)
        self.dataset_name = "clarin-pl/2021-punctuation-restoration"
        self.description = """
        Speech transcripts generated by Automatic Speech Recognition (ASR) systems typically do 
        not contain any punctuation or capitalization. In longer stretches of automatically recognized speech, 
        the lack of punctuation affects the general clarity of the output text [1]. The primary purpose of 
        punctuation (PR) and capitalization restoration (CR) as a distinct natural language processing (NLP) task is 
        to improve the legibility of ASR-generated text, and possibly other types of texts without punctuation. Aside 
        from their intrinsic value, PR and CR may improve the performance of other NLP aspects such as Named Entity 
        Recognition (NER), part-of-speech (POS) and semantic parsing or spoken dialog segmentation [2, 3]. As useful 
        as it seems, It is hard to systematically evaluate PR on transcripts of conversational language; mainly 
        because punctuation rules can be ambiguous even for originally written texts, and the very nature of 
        naturally-occurring spoken language makes it difficult to identify clear phrase and sentence boundaries [4,
        5]. Given these requirements and limitations, a PR task based on a redistributable corpus of read speech was 
        suggested. 1200 texts included in this collection (totaling over 240,000 words) were selected from two 
        distinct sources: WikiNews and WikiTalks. Punctuation found in these sources should be approached with some 
        reservation when used for evaluation: these are original texts and may contain some user-induced errors and 
        bias. The texts were read out by over a hundred different speakers. Original texts with punctuation were 
        forced-aligned with recordings and used as the ideal ASR output. The goal of the task is to provide a 
        solution for restoring punctuation in the test set collated for this task. The test set consists of 
        time-aligned ASR transcriptions of read texts from the two sources. Participants are encouraged to use both 
        text-based and speech-derived features to identify punctuation symbols (e.g. multimodal framework [6]). In 
        addition, the train set is accompanied by reference text corpora of WikiNews and WikiTalks data that can be 
        used in training and fine-tuning punctuation models. 

        Task description
        The purpose of this task is to restore punctuation in the ASR recognition of texts read out loud.
        """

    def load_data(self):
        raw_dataset = load_dataset(self.dataset_name)
        self.data_dict = {
            subset: raw_dataset[subset].to_pandas() for subset in self.subsets
        }

    def show_dataset(self):
        pass