Spaces:
Runtime error
Runtime error
import pandas as pd | |
from datasets import load_dataset | |
import streamlit as st | |
from clarin_datasets.dataset_to_show import DatasetToShow | |
class PunctuationRestorationDataset(DatasetToShow): | |
def __init__(self): | |
DatasetToShow.__init__(self) | |
self.dataset_name = "clarin-pl/2021-punctuation-restoration" | |
self.description = """ | |
Speech transcripts generated by Automatic Speech Recognition (ASR) systems typically do | |
not contain any punctuation or capitalization. In longer stretches of automatically recognized speech, | |
the lack of punctuation affects the general clarity of the output text [1]. The primary purpose of | |
punctuation (PR) and capitalization restoration (CR) as a distinct natural language processing (NLP) task is | |
to improve the legibility of ASR-generated text, and possibly other types of texts without punctuation. Aside | |
from their intrinsic value, PR and CR may improve the performance of other NLP aspects such as Named Entity | |
Recognition (NER), part-of-speech (POS) and semantic parsing or spoken dialog segmentation [2, 3]. As useful | |
as it seems, It is hard to systematically evaluate PR on transcripts of conversational language; mainly | |
because punctuation rules can be ambiguous even for originally written texts, and the very nature of | |
naturally-occurring spoken language makes it difficult to identify clear phrase and sentence boundaries [4, | |
5]. Given these requirements and limitations, a PR task based on a redistributable corpus of read speech was | |
suggested. 1200 texts included in this collection (totaling over 240,000 words) were selected from two | |
distinct sources: WikiNews and WikiTalks. Punctuation found in these sources should be approached with some | |
reservation when used for evaluation: these are original texts and may contain some user-induced errors and | |
bias. The texts were read out by over a hundred different speakers. Original texts with punctuation were | |
forced-aligned with recordings and used as the ideal ASR output. The goal of the task is to provide a | |
solution for restoring punctuation in the test set collated for this task. The test set consists of | |
time-aligned ASR transcriptions of read texts from the two sources. Participants are encouraged to use both | |
text-based and speech-derived features to identify punctuation symbols (e.g. multimodal framework [6]). In | |
addition, the train set is accompanied by reference text corpora of WikiNews and WikiTalks data that can be | |
used in training and fine-tuning punctuation models. | |
Task description | |
The purpose of this task is to restore punctuation in the ASR recognition of texts read out loud. | |
""" | |
def load_data(self): | |
raw_dataset = load_dataset(self.dataset_name) | |
self.data_dict = { | |
subset: raw_dataset[subset].to_pandas() for subset in self.subsets | |
} | |
def show_dataset(self): | |
pass | |