Spaces:

clarin-pl
/

datasets-explorer

Runtime error

App Files Files Community

Mariusz Kossakowski commited on Aug 23, 2022

Commit

9f7f573

•

1 Parent(s): 739b527

Add more datasets

Browse files

Files changed (5) hide show

clarin_datasets/cst_wikinews_dataset.py +20 -0
clarin_datasets/kpwr_ner_datasets.py +62 -0
clarin_datasets/nkjp_pos_dataset.py +20 -0
clarin_datasets/punctuation_restoration_dataset.py +43 -0
clarin_datasets/punctuation_restoration_task.png +0 -0

clarin_datasets/cst_wikinews_dataset.py ADDED Viewed

	@@ -0,0 +1,20 @@

+import pandas as pd
+from datasets import load_dataset
+import streamlit as st
+from clarin_datasets.dataset_to_show import DatasetToShow
+class CSTWikinewsDataset(DatasetToShow):
+    def __init__(self):
+        DatasetToShow.__init__(self)
+        self.dataset_name = "clarin-pl/cst-wikinews"
+        self.description = """
+        """
+    def load_data(self):
+        DatasetToShow.load_data(self)
+    def show_dataset(self):
+        pass

clarin_datasets/kpwr_ner_datasets.py ADDED Viewed

	@@ -0,0 +1,62 @@

+from datasets import load_dataset
+import streamlit as st
+from clarin_datasets.dataset_to_show import DatasetToShow
+class KpwrNerDataset(DatasetToShow):
+    def __init__(self):
+        DatasetToShow.__init__(self)
+        self.dataset_name = "clarin-pl/kpwr-ner"
+        self.description = """
+        KPWR-NER is a part the Polish Corpus of Wrocław University of Technology (Korpus Języka
+        Polskiego Politechniki Wrocławskiej). Its objective is named entity recognition for fine-grained categories
+        of entities. It is the ‘n82’ version of the KPWr, which means that number of classes is restricted to 82 (
+        originally 120). During corpus creation, texts were annotated by humans from various sources, covering many
+        domains and genres.
+        Tasks (input, output and metrics)
+        Named entity recognition (NER) - tagging entities in text with their corresponding type.
+        Input ('tokens' column): sequence of tokens
+        Output ('ner' column): sequence of predicted tokens’ classes in BIO notation (82 possible classes, described
+        in detail in the annotation guidelines)
+        example:
+        [‘Roboty’, ‘mają’, ‘kilkanaście’, ‘lat’, ‘i’, ‘pochodzą’, ‘z’, ‘USA’, ‘,’, ‘Wysokie’, ‘napięcie’, ‘jest’,
+        ‘dużo’, ‘młodsze’, ‘,’, ‘powstało’, ‘w’, ‘Niemczech’, ‘.’] → [‘B-nam_pro_title’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’,
+        ‘O’, ‘B-nam_loc_gpe_country’, ‘O’, ‘B-nam_pro_title’, ‘I-nam_pro_title’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’,
+        ‘B-nam_loc_gpe_country’, ‘O’]
+        """
+    def load_data(self):
+        raw_dataset = load_dataset(self.dataset_name)
+        self.data_dict = {
+            subset: raw_dataset[subset].to_pandas()
+            for subset in self.subsets
+        }
+    def show_dataset(self):
+        header = st.container()
+        description = st.container()
+        dataframe_head = st.container()
+        with header:
+            st.title(self.dataset_name)
+        with description:
+            st.header("Dataset description")
+            st.write(self.description)

clarin_datasets/nkjp_pos_dataset.py ADDED Viewed

	@@ -0,0 +1,20 @@

+import pandas as pd
+from datasets import load_dataset
+import streamlit as st
+from clarin_datasets.dataset_to_show import DatasetToShow
+class NkjpPosDataset(DatasetToShow):
+    def __init__(self):
+        DatasetToShow.__init__(self)
+        self.dataset_name = "clarin-pl/nkjp-pos"
+        self.description = """
+        """
+    def load_data(self):
+        DatasetToShow.load_data(self)
+    def show_dataset(self):
+        pass

clarin_datasets/punctuation_restoration_dataset.py ADDED Viewed

	@@ -0,0 +1,43 @@

+import pandas as pd
+from datasets import load_dataset
+import streamlit as st
+from clarin_datasets.dataset_to_show import DatasetToShow
+class PunctuationRestorationDataset(DatasetToShow):
+    def __init__(self):
+        DatasetToShow.__init__(self)
+        self.dataset_name = "clarin-pl/2021-punctuation-restoration"
+        self.description = """
+        Speech transcripts generated by Automatic Speech Recognition (ASR) systems typically do
+        not contain any punctuation or capitalization. In longer stretches of automatically recognized speech,
+        the lack of punctuation affects the general clarity of the output text [1]. The primary purpose of
+        punctuation (PR) and capitalization restoration (CR) as a distinct natural language processing (NLP) task is
+        to improve the legibility of ASR-generated text, and possibly other types of texts without punctuation. Aside
+        from their intrinsic value, PR and CR may improve the performance of other NLP aspects such as Named Entity
+        Recognition (NER), part-of-speech (POS) and semantic parsing or spoken dialog segmentation [2, 3]. As useful
+        as it seems, It is hard to systematically evaluate PR on transcripts of conversational language; mainly
+        because punctuation rules can be ambiguous even for originally written texts, and the very nature of
+        naturally-occurring spoken language makes it difficult to identify clear phrase and sentence boundaries [4,
+        5]. Given these requirements and limitations, a PR task based on a redistributable corpus of read speech was
+        suggested. 1200 texts included in this collection (totaling over 240,000 words) were selected from two
+        distinct sources: WikiNews and WikiTalks. Punctuation found in these sources should be approached with some
+        reservation when used for evaluation: these are original texts and may contain some user-induced errors and
+        bias. The texts were read out by over a hundred different speakers. Original texts with punctuation were
+        forced-aligned with recordings and used as the ideal ASR output. The goal of the task is to provide a
+        solution for restoring punctuation in the test set collated for this task. The test set consists of
+        time-aligned ASR transcriptions of read texts from the two sources. Participants are encouraged to use both
+        text-based and speech-derived features to identify punctuation symbols (e.g. multimodal framework [6]). In
+        addition, the train set is accompanied by reference text corpora of WikiNews and WikiTalks data that can be
+        used in training and fine-tuning punctuation models.
+        Task description
+        The purpose of this task is to restore punctuation in the ASR recognition of texts read out loud.
+        """
+    def load_data(self):
+        DatasetToShow.load_data(self)
+    def show_dataset(self):
+        pass

clarin_datasets/punctuation_restoration_task.png ADDED Viewed