Upload 4 files

Browse files

Files changed (4) hide show

README.md +356 -1
arabic_billion_words.py +173 -0
dataset_infos.json +1 -0
gitattributes.txt +27 -0

README.md CHANGED Viewed

@@ -1,3 +1,358 @@
 ---
-license: openrail
 ---

 ---
+annotations_creators:
+- found
+language_creators:
+- found
+language:
+- ar
+license:
+- unknown
+multilinguality:
+- monolingual
+size_categories:
+- 100K<n<1M
+- 10K<n<100K
+- 1M<n<10M
+source_datasets:
+- original
+task_categories:
+- text-generation
+- fill-mask
+task_ids:
+- language-modeling
+- masked-language-modeling
+paperswithcode_id: null
+pretty_name: Arabic Billion Words
+configs:
+- Alittihad
+- Almasryalyoum
+- Almustaqbal
+- Alqabas
+- Echoroukonline
+- Ryiadh
+- Sabanews
+- SaudiYoum
+- Techreen
+- Youm7
+dataset_info:
+- config_name: Alittihad
+  features:
+  - name: url
+    dtype: string
+  - name: head_line
+    dtype: string
+  - name: date
+    dtype: string
+  - name: text
+    dtype: string
+  splits:
+  - name: train
+    num_bytes: 1601790302
+    num_examples: 349342
+  download_size: 348259999
+  dataset_size: 1601790302
+- config_name: Almasryalyoum
+  features:
+  - name: url
+    dtype: string
+  - name: head_line
+    dtype: string
+  - name: date
+    dtype: string
+  - name: text
+    dtype: string
+  splits:
+  - name: train
+    num_bytes: 1056197870
+    num_examples: 291723
+  download_size: 242604438
+  dataset_size: 1056197870
+- config_name: Almustaqbal
+  features:
+  - name: url
+    dtype: string
+  - name: head_line
+    dtype: string
+  - name: date
+    dtype: string
+  - name: text
+    dtype: string
+  splits:
+  - name: train
+    num_bytes: 1545659336
+    num_examples: 446873
+  download_size: 350826797
+  dataset_size: 1545659336
+- config_name: Alqabas
+  features:
+  - name: url
+    dtype: string
+  - name: head_line
+    dtype: string
+  - name: date
+    dtype: string
+  - name: text
+    dtype: string
+  splits:
+  - name: train
+    num_bytes: 2631729746
+    num_examples: 817274
+  download_size: 595274646
+  dataset_size: 2631729746
+- config_name: Echoroukonline
+  features:
+  - name: url
+    dtype: string
+  - name: head_line
+    dtype: string
+  - name: date
+    dtype: string
+  - name: text
+    dtype: string
+  splits:
+  - name: train
+    num_bytes: 464386206
+    num_examples: 139732
+  download_size: 108184378
+  dataset_size: 464386206
+- config_name: Ryiadh
+  features:
+  - name: url
+    dtype: string
+  - name: head_line
+    dtype: string
+  - name: date
+    dtype: string
+  - name: text
+    dtype: string
+  splits:
+  - name: train
+    num_bytes: 3101294859
+    num_examples: 858188
+  download_size: 691264971
+  dataset_size: 3101294859
+- config_name: Sabanews
+  features:
+  - name: url
+    dtype: string
+  - name: head_line
+    dtype: string
+  - name: date
+    dtype: string
+  - name: text
+    dtype: string
+  splits:
+  - name: train
+    num_bytes: 198019614
+    num_examples: 92149
+  download_size: 38214558
+  dataset_size: 198019614
+- config_name: SaudiYoum
+  features:
+  - name: url
+    dtype: string
+  - name: head_line
+    dtype: string
+  - name: date
+    dtype: string
+  - name: text
+    dtype: string
+  splits:
+  - name: train
+    num_bytes: 2723291416
+    num_examples: 888068
+  download_size: 605537923
+  dataset_size: 2723291416
+- config_name: Techreen
+  features:
+  - name: url
+    dtype: string
+  - name: head_line
+    dtype: string
+  - name: date
+    dtype: string
+  - name: text
+    dtype: string
+  splits:
+  - name: train
+    num_bytes: 1103458209
+    num_examples: 314597
+  download_size: 252976781
+  dataset_size: 1103458209
+- config_name: Youm7
+  features:
+  - name: url
+    dtype: string
+  - name: head_line
+    dtype: string
+  - name: date
+    dtype: string
+  - name: text
+    dtype: string
+  splits:
+  - name: train
+    num_bytes: 3004689464
+    num_examples: 1172136
+  download_size: 617708074
+  dataset_size: 3004689464
 ---
+# Dataset Card for Arabic Billion Words Corpus
+## Table of Contents
+- [Dataset Description](#dataset-description)
+  - [Dataset Summary](#dataset-summary)
+  - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
+  - [Languages](#languages)
+- [Dataset Structure](#dataset-structure)
+  - [Data Instances](#data-instances)
+  - [Data Fields](#data-fields)
+  - [Data Splits](#data-splits)
+- [Dataset Creation](#dataset-creation)
+  - [Curation Rationale](#curation-rationale)
+  - [Source Data](#source-data)
+  - [Annotations](#annotations)
+  - [Personal and Sensitive Information](#personal-and-sensitive-information)
+- [Considerations for Using the Data](#considerations-for-using-the-data)
+  - [Social Impact of Dataset](#social-impact-of-dataset)
+  - [Discussion of Biases](#discussion-of-biases)
+  - [Other Known Limitations](#other-known-limitations)
+- [Additional Information](#additional-information)
+  - [Dataset Curators](#dataset-curators)
+  - [Licensing Information](#licensing-information)
+  - [Citation Information](#citation-information)
+  - [Contributions](#contributions)
+## Dataset Description
+- **Homepage:** http://www.abuelkhair.net/index.php/en/arabic/abu-el-khair-corpus
+- **Repository:**
+- **Paper:** https://arxiv.org/pdf/1611.04033
+- **Leaderboard:**
+- **Point of Contact:**[Ibrahim Abu El-Khair](iabuelkhair@gmail.com)
+### Dataset Summary
+Abu El-Khair Corpus is an Arabic text corpus, that includes more than five million newspaper articles.
+It contains over a billion and a half words in total, out of which, there are about three million unique words.
+The corpus is encoded with two types of encoding, namely: UTF-8, and Windows CP-1256.
+Also it was marked with two mark-up languages, namely: SGML, and XML.
+### Supported Tasks and Leaderboards
+[More Information Needed]
+### Languages
+Arabic
+## Dataset Structure
+### Data Instances
+This is an example of the "Almasryalyoum" configuration subset:
+```python
+{
+  "url": "http://today.almasryalyoum.com/printerfriendly.aspx?ArticleID=61300",
+  "head_line": "رئيس وزراء المجر: عنصرية جماهير أوجبيست جلبت العار للبلاد",
+  "date": "19/5/2007",
+  "text": """قال متحدث باسم الحكومة المجرية: إن رئيس الوزراء فيرنك جيوركساني رحب بقرار اتحاد كرة القدم المجري بخصم ثلاث نقاط من نادي أوجبيست بسبب السلوك العنصري الذي صدر من جماهيره.
+وعاقب الاتحاد المجري فريق أوجبيست بعد أن سخرت جماهيره من إبراهيم سيديبي مهاجم فريق ديبرينسين الأسود أثناء مباراة الفريقين أوائل مايو الجاري.
+يذكر أن الاتحاد فرض أيضا غرامة مالية قدرها 20 ألف دولار علي أوجبيست في عام 2005 بعد أن رددت جماهيره شعارات معادية للسامية خلال مباراة بالدوري المجري.
+وأوضح جيوركساني في خطاب إلي إيستفان كيستليكي رئيس الاتحاد المجري لكرة القدم، أن هذا السلوك العنصري من الجماهير «جلب العار لكرة القدم وللمجر». يذكر أن المجر بها مجموعة من مشجعي كرة القدم المشاغبين «الهوليجانز»، وشارك الكثير منهم في أعمال شغب معادية للحكومة في العام الماضي.""",
+}
+```
+### Data Fields
+The data fields are:
+- "url": string, original url of the article,
+- "head_line": string, headline of the article,
+- "date": string, date of the article,
+- "text": string, text content of the article,
+### Data Splits
+There is only one "training" split for all configuration subsets, containing the following number of examples:
+|                | Number of examples |
+|:---------------|-------------------:|
+| Alittihad      |             349342 |
+| Almasryalyoum  |             291723 |
+| Almustaqbal    |             446873 |
+| Alqabas        |             817274 |
+| Echoroukonline |             139732 |
+| Ryiadh         |             858188 |
+| Sabanews       |              92149 |
+| SaudiYoum      |             888068 |
+| Techreen       |             314597 |
+| Youm7          |            1172136 |
+## Dataset Creation
+### Curation Rationale
+[More Information Needed]
+### Source Data
+#### Initial Data Collection and Normalization
+[More Information Needed]
+#### Who are the source language producers?
+[More Information Needed]
+### Annotations
+#### Annotation process
+[More Information Needed]
+#### Who are the annotators?
+[More Information Needed]
+### Personal and Sensitive Information
+[More Information Needed]
+## Considerations for Using the Data
+### Social Impact of Dataset
+[More Information Needed]
+### Discussion of Biases
+[More Information Needed]
+### Other Known Limitations
+[More Information Needed]
+## Additional Information
+### Dataset Curators
+[More Information Needed]
+### Licensing Information
+[More Information Needed]
+### Citation Information
+```
+@article{el20161,
+  title={1.5 billion words arabic corpus},
+  author={El-Khair, Ibrahim Abu},
+  journal={arXiv preprint arXiv:1611.04033},
+  year={2016}
+}
+```
+### Contributions
+Thanks to [@zaidalyafeai](https://github.com/zaidalyafeai) and [@albertvillanova](https://github.com/albertvillanova) for adding this dataset.

arabic_billion_words.py ADDED Viewed

	@@ -0,0 +1,173 @@

+# coding=utf-8
+# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Arabic Billion Words Corpus"""
+import os
+import re
+import datasets
+_CITATION = """\
+@article{el20161,
+  title={1.5 billion words arabic corpus},
+  author={El-Khair, Ibrahim Abu},
+  journal={arXiv preprint arXiv:1611.04033},
+  year={2016}
+}
+"""
+_DESCRIPTION = """\
+Abu El-Khair Corpus is an Arabic text corpus, that includes more than five million newspaper articles.
+It contains over a billion and a half words in total, out of which, there are about three million unique words.
+The corpus is encoded with two types of encoding, namely: UTF-8, and Windows CP-1256.
+Also it was marked with two mark-up languages, namely: SGML, and XML.
+"""
+_HOMEPAGE = "http://abuelkhair.net/index.php/en/arabic/abu-el-khair-corpus"
+_URL = "http://abuelkhair.net/corpus/"
+_URLs = {
+    "Alittihad": _URL + "Alittihad_XML_utf_8.rar",
+    "Almasryalyoum": _URL + "Almasryalyoum_XML_utf_8.rar",
+    "Almustaqbal": _URL + "Almustaqbal_XML_utf_8.rar",
+    "Alqabas": _URL + "Alqabas_XML_utf_8.rar",
+    "Echoroukonline": _URL + "Echoroukonline_XML_utf_8.rar",
+    "Ryiadh": _URL + "Ryiadh_XML_utf_8.rar",
+    "Sabanews": _URL + "Sabanews_XML_utf_8.rar",
+    "SaudiYoum": _URL + "SaudiYoum_XML_utf_8.rar",
+    "Techreen": _URL + "Techreen_XML_utf_8.rar",
+    "Youm7": _URL + "Youm7_XML_utf_8.rar",
+}
+# Some tags are misspelled
+# - Misspelled article tags:
+#   - Alqabas: <Alqabas>, <Alqabas1>
+#   - Ryiadh: <Ryiadh>, <Ryiadh1>
+MISSPELLED_TAGS = {
+    "Dateline": ["Dateline", "dateline"],
+    "Headline": ["Headline", "Healine"],
+    "Text": ["Text"],
+    "URL": ["URL"],
+}
+TAG_PATTERNS = {
+    tag: [re.compile(rf".*?<{label}>(.*?)</{label}>.*?", re.MULTILINE | re.DOTALL) for label in labels]
+    for tag, labels in MISSPELLED_TAGS.items()
+}
+class ArabicBillionWords(datasets.GeneratorBasedBuilder):
+    """Arabic Billion Words Corpus"""
+    VERSION = datasets.Version("1.1.0")
+    BUILDER_CONFIGS = [
+        datasets.BuilderConfig(
+            name="Alittihad", version=VERSION, description="This part of dataset covers Alittihad news paper"
+        ),
+        datasets.BuilderConfig(
+            name="Almasryalyoum", version=VERSION, description="This part of dataset covers Almasryalyoum news paper"
+        ),
+        datasets.BuilderConfig(
+            name="Almustaqbal", version=VERSION, description="This part of dataset covers Almustaqbal news paper"
+        ),
+        datasets.BuilderConfig(
+            name="Alqabas", version=VERSION, description="This part of dataset covers Alqabas news paper"
+        ),
+        datasets.BuilderConfig(
+            name="Echoroukonline", version=VERSION, description="This part of dataset covers Echoroukonline news paper"
+        ),
+        datasets.BuilderConfig(
+            name="Ryiadh", version=VERSION, description="This part of dataset covers Ryiadh news paper"
+        ),
+        datasets.BuilderConfig(
+            name="Sabanews", version=VERSION, description="This part of dataset covers Sabanews news paper"
+        ),
+        datasets.BuilderConfig(
+            name="SaudiYoum", version=VERSION, description="This part of dataset covers SaudiYoum news paper"
+        ),
+        datasets.BuilderConfig(
+            name="Techreen", version=VERSION, description="This part of dataset covers Techreen news paper"
+        ),
+        datasets.BuilderConfig(
+            name="Youm7", version=VERSION, description="This part of dataset covers Youm7 news paper"
+        ),
+    ]
+    def _info(self):
+        features = datasets.Features(
+            {
+                "url": datasets.Value("string"),
+                "head_line": datasets.Value("string"),
+                "date": datasets.Value("string"),
+                "text": datasets.Value("string"),
+            }
+        )
+        return datasets.DatasetInfo(
+            description=_DESCRIPTION,
+            features=features,
+            homepage=_HOMEPAGE,
+            citation=_CITATION,
+        )
+    def _split_generators(self, dl_manager):
+        """Returns SplitGenerators."""
+        my_urls = _URLs[self.config.name]
+        data_dir = dl_manager.download_and_extract(my_urls)
+        my_file_name = f"{self.config.name}_utf_8.xml"
+        return [
+            datasets.SplitGenerator(
+                name=datasets.Split.TRAIN,
+                gen_kwargs={
+                    "filepath": os.path.join(data_dir, my_file_name),
+                },
+            ),
+        ]
+    def _generate_examples(self, filepath):
+        """Yields examples."""
+        data_tag = self.config.name
+        pattern = re.compile(rf".*?<{data_tag}(.*?)</{data_tag}.*?", re.MULTILINE | re.DOTALL)
+        key = 0
+        lines = ""
+        with open(filepath, mode="r", encoding="utf-8") as f:
+            for i, line in enumerate(f):
+                lines += line
+                if f"</{data_tag}" in line:
+                    match = pattern.match(lines)
+                    lines = ""
+                    if match:
+                        record = match.group(1)
+                        text = self._clean_text(self._extract_tag("Text", record))
+                        url = self._extract_tag("URL", record)
+                        head_line = self._clean_text(self._extract_tag("Headline", record))
+                        date = self._extract_tag("Dateline", record)
+                        yield key, {"url": url, "head_line": head_line, "date": date, "text": text}
+                        key += 1
+    @staticmethod
+    def _extract_tag(tag, text):
+        # check if the tag is misspelled
+        for pattern in TAG_PATTERNS[tag]:
+            match = pattern.match(text)
+            if match:
+                return match.group(1)
+        return ""
+    @staticmethod
+    def _clean_text(text):
+        return text.replace("?", "")

dataset_infos.json ADDED Viewed

	@@ -0,0 +1 @@

+ {"Alittihad": {"description": "Abu El-Khair Corpus is an Arabic text corpus, that includes more than five million newspaper articles.\nIt contains over a billion and a half words in total, out of which, there are about three million unique words.\nThe corpus is encoded with two types of encoding, namely: UTF-8, and Windows CP-1256.\nAlso it was marked with two mark-up languages, namely: SGML, and XML.\n", "citation": "@article{el20161,\n title={1.5 billion words arabic corpus},\n author={El-Khair, Ibrahim Abu},\n journal={arXiv preprint arXiv:1611.04033},\n year={2016}\n}\n", "homepage": "http://abuelkhair.net/index.php/en/arabic/abu-el-khair-corpus", "license": "", "features": {"url": {"dtype": "string", "id": null, "_type": "Value"}, "head_line": {"dtype": "string", "id": null, "_type": "Value"}, "date": {"dtype": "string", "id": null, "_type": "Value"}, "text": {"dtype": "string", "id": null, "_type": "Value"}}, "post_processed": null, "supervised_keys": null, "task_templates": null, "builder_name": "arabic_billion_words", "config_name": "Alittihad", "version": {"version_str": "1.1.0", "description": null, "major": 1, "minor": 1, "patch": 0}, "splits": {"train": {"name": "train", "num_bytes": 1601790302, "num_examples": 349342, "dataset_name": "arabic_billion_words"}}, "download_checksums": {"http://abuelkhair.net/corpus/Alittihad_XML_utf_8.rar": {"num_bytes": 348259999, "checksum": "6dd90f7ca98699e924e0ea423dc9f4f648c645379f8bffe15eeb97af00fd6fc0"}}, "download_size": 348259999, "post_processing_size": null, "dataset_size": 1601790302, "size_in_bytes": 1950050301}, "Almasryalyoum": {"description": "Abu El-Khair Corpus is an Arabic text corpus, that includes more than five million newspaper articles.\nIt contains over a billion and a half words in total, out of which, there are about three million unique words.\nThe corpus is encoded with two types of encoding, namely: UTF-8, and Windows CP-1256.\nAlso it was marked with two mark-up languages, namely: SGML, and XML.\n", "citation": "@article{el20161,\n title={1.5 billion words arabic corpus},\n author={El-Khair, Ibrahim Abu},\n journal={arXiv preprint arXiv:1611.04033},\n year={2016}\n}\n", "homepage": "http://abuelkhair.net/index.php/en/arabic/abu-el-khair-corpus", "license": "", "features": {"url": {"dtype": "string", "id": null, "_type": "Value"}, "head_line": {"dtype": "string", "id": null, "_type": "Value"}, "date": {"dtype": "string", "id": null, "_type": "Value"}, "text": {"dtype": "string", "id": null, "_type": "Value"}}, "post_processed": null, "supervised_keys": null, "task_templates": null, "builder_name": "arabic_billion_words", "config_name": "Almasryalyoum", "version": {"version_str": "1.1.0", "description": null, "major": 1, "minor": 1, "patch": 0}, "splits": {"train": {"name": "train", "num_bytes": 1056197870, "num_examples": 291723, "dataset_name": "arabic_billion_words"}}, "download_checksums": {"http://abuelkhair.net/corpus/Almasryalyoum_XML_utf_8.rar": {"num_bytes": 242604438, "checksum": "f88d24179fa97df8d179242cb564301be2c7a4ecd36a027815b8ce1563059e7a"}}, "download_size": 242604438, "post_processing_size": null, "dataset_size": 1056197870, "size_in_bytes": 1298802308}, "Almustaqbal": {"description": "Abu El-Khair Corpus is an Arabic text corpus, that includes more than five million newspaper articles.\nIt contains over a billion and a half words in total, out of which, there are about three million unique words.\nThe corpus is encoded with two types of encoding, namely: UTF-8, and Windows CP-1256.\nAlso it was marked with two mark-up languages, namely: SGML, and XML.\n", "citation": "@article{el20161,\n title={1.5 billion words arabic corpus},\n author={El-Khair, Ibrahim Abu},\n journal={arXiv preprint arXiv:1611.04033},\n year={2016}\n}\n", "homepage": "http://abuelkhair.net/index.php/en/arabic/abu-el-khair-corpus", "license": "", "features": {"url": {"dtype": "string", "id": null, "_type": "Value"}, "head_line": {"dtype": "string", "id": null, "_type": "Value"}, "date": {"dtype": "string", "id": null, "_type": "Value"}, "text": {"dtype": "string", "id": null, "_type": "Value"}}, "post_processed": null, "supervised_keys": null, "task_templates": null, "builder_name": "arabic_billion_words", "config_name": "Almustaqbal", "version": {"version_str": "1.1.0", "description": null, "major": 1, "minor": 1, "patch": 0}, "splits": {"train": {"name": "train", "num_bytes": 1545659336, "num_examples": 446873, "dataset_name": "arabic_billion_words"}}, "download_checksums": {"http://abuelkhair.net/corpus/Almustaqbal_XML_utf_8.rar": {"num_bytes": 350826797, "checksum": "dff3361ad821f3bd3912cd7282db5c15a34919312b9bc7d708a8b30782c7fc36"}}, "download_size": 350826797, "post_processing_size": null, "dataset_size": 1545659336, "size_in_bytes": 1896486133}, "Alqabas": {"description": "Abu El-Khair Corpus is an Arabic text corpus, that includes more than five million newspaper articles.\nIt contains over a billion and a half words in total, out of which, there are about three million unique words.\nThe corpus is encoded with two types of encoding, namely: UTF-8, and Windows CP-1256.\nAlso it was marked with two mark-up languages, namely: SGML, and XML.\n", "citation": "@article{el20161,\n title={1.5 billion words arabic corpus},\n author={El-Khair, Ibrahim Abu},\n journal={arXiv preprint arXiv:1611.04033},\n year={2016}\n}\n", "homepage": "http://abuelkhair.net/index.php/en/arabic/abu-el-khair-corpus", "license": "", "features": {"url": {"dtype": "string", "id": null, "_type": "Value"}, "head_line": {"dtype": "string", "id": null, "_type": "Value"}, "date": {"dtype": "string", "id": null, "_type": "Value"}, "text": {"dtype": "string", "id": null, "_type": "Value"}}, "post_processed": null, "supervised_keys": null, "task_templates": null, "builder_name": "arabic_billion_words", "config_name": "Alqabas", "version": {"version_str": "1.1.0", "description": null, "major": 1, "minor": 1, "patch": 0}, "splits": {"train": {"name": "train", "num_bytes": 2631729746, "num_examples": 817274, "dataset_name": "arabic_billion_words"}}, "download_checksums": {"http://abuelkhair.net/corpus/Alqabas_XML_utf_8.rar": {"num_bytes": 595274646, "checksum": "e5ea70add534220a8caf8d230959f134f49a822ce3612adb4f1bb537dc3cc6b4"}}, "download_size": 595274646, "post_processing_size": null, "dataset_size": 2631729746, "size_in_bytes": 3227004392}, "Echoroukonline": {"description": "Abu El-Khair Corpus is an Arabic text corpus, that includes more than five million newspaper articles.\nIt contains over a billion and a half words in total, out of which, there are about three million unique words.\nThe corpus is encoded with two types of encoding, namely: UTF-8, and Windows CP-1256.\nAlso it was marked with two mark-up languages, namely: SGML, and XML.\n", "citation": "@article{el20161,\n title={1.5 billion words arabic corpus},\n author={El-Khair, Ibrahim Abu},\n journal={arXiv preprint arXiv:1611.04033},\n year={2016}\n}\n", "homepage": "http://abuelkhair.net/index.php/en/arabic/abu-el-khair-corpus", "license": "", "features": {"url": {"dtype": "string", "id": null, "_type": "Value"}, "head_line": {"dtype": "string", "id": null, "_type": "Value"}, "date": {"dtype": "string", "id": null, "_type": "Value"}, "text": {"dtype": "string", "id": null, "_type": "Value"}}, "post_processed": null, "supervised_keys": null, "task_templates": null, "builder_name": "arabic_billion_words", "config_name": "Echoroukonline", "version": {"version_str": "1.1.0", "description": null, "major": 1, "minor": 1, "patch": 0}, "splits": {"train": {"name": "train", "num_bytes": 464386206, "num_examples": 139732, "dataset_name": "arabic_billion_words"}}, "download_checksums": {"http://abuelkhair.net/corpus/Echoroukonline_XML_utf_8.rar": {"num_bytes": 108184378, "checksum": "8f3e85bd99caeb9c5c4922edcd18720fc3700fd6751febfa7ee72e05a584a270"}}, "download_size": 108184378, "post_processing_size": null, "dataset_size": 464386206, "size_in_bytes": 572570584}, "Ryiadh": {"description": "Abu El-Khair Corpus is an Arabic text corpus, that includes more than five million newspaper articles.\nIt contains over a billion and a half words in total, out of which, there are about three million unique words.\nThe corpus is encoded with two types of encoding, namely: UTF-8, and Windows CP-1256.\nAlso it was marked with two mark-up languages, namely: SGML, and XML.\n", "citation": "@article{el20161,\n title={1.5 billion words arabic corpus},\n author={El-Khair, Ibrahim Abu},\n journal={arXiv preprint arXiv:1611.04033},\n year={2016}\n}\n", "homepage": "http://abuelkhair.net/index.php/en/arabic/abu-el-khair-corpus", "license": "", "features": {"url": {"dtype": "string", "id": null, "_type": "Value"}, "head_line": {"dtype": "string", "id": null, "_type": "Value"}, "date": {"dtype": "string", "id": null, "_type": "Value"}, "text": {"dtype": "string", "id": null, "_type": "Value"}}, "post_processed": null, "supervised_keys": null, "task_templates": null, "builder_name": "arabic_billion_words", "config_name": "Ryiadh", "version": {"version_str": "1.1.0", "description": null, "major": 1, "minor": 1, "patch": 0}, "splits": {"train": {"name": "train", "num_bytes": 3101294859, "num_examples": 858188, "dataset_name": "arabic_billion_words"}}, "download_checksums": {"http://abuelkhair.net/corpus/Ryiadh_XML_utf_8.rar": {"num_bytes": 691264971, "checksum": "c934867e53cb57d45ff99a8b5cfa991ae255a1ecb20e79309a41af2aa3e45c15"}}, "download_size": 691264971, "post_processing_size": null, "dataset_size": 3101294859, "size_in_bytes": 3792559830}, "Sabanews": {"description": "Abu El-Khair Corpus is an Arabic text corpus, that includes more than five million newspaper articles.\nIt contains over a billion and a half words in total, out of which, there are about three million unique words.\nThe corpus is encoded with two types of encoding, namely: UTF-8, and Windows CP-1256.\nAlso it was marked with two mark-up languages, namely: SGML, and XML.\n", "citation": "@article{el20161,\n title={1.5 billion words arabic corpus},\n author={El-Khair, Ibrahim Abu},\n journal={arXiv preprint arXiv:1611.04033},\n year={2016}\n}\n", "homepage": "http://abuelkhair.net/index.php/en/arabic/abu-el-khair-corpus", "license": "", "features": {"url": {"dtype": "string", "id": null, "_type": "Value"}, "head_line": {"dtype": "string", "id": null, "_type": "Value"}, "date": {"dtype": "string", "id": null, "_type": "Value"}, "text": {"dtype": "string", "id": null, "_type": "Value"}}, "post_processed": null, "supervised_keys": null, "task_templates": null, "builder_name": "arabic_billion_words", "config_name": "Sabanews", "version": {"version_str": "1.1.0", "description": null, "major": 1, "minor": 1, "patch": 0}, "splits": {"train": {"name": "train", "num_bytes": 198019614, "num_examples": 92149, "dataset_name": "arabic_billion_words"}}, "download_checksums": {"http://abuelkhair.net/corpus/Sabanews_XML_utf_8.rar": {"num_bytes": 38214558, "checksum": "c9b2f1ac8ed2a5e89ab9a6bcd82a0d825569b813b53cd83419968782e9946dbe"}}, "download_size": 38214558, "post_processing_size": null, "dataset_size": 198019614, "size_in_bytes": 236234172}, "SaudiYoum": {"description": "Abu El-Khair Corpus is an Arabic text corpus, that includes more than five million newspaper articles.\nIt contains over a billion and a half words in total, out of which, there are about three million unique words.\nThe corpus is encoded with two types of encoding, namely: UTF-8, and Windows CP-1256.\nAlso it was marked with two mark-up languages, namely: SGML, and XML.\n", "citation": "@article{el20161,\n title={1.5 billion words arabic corpus},\n author={El-Khair, Ibrahim Abu},\n journal={arXiv preprint arXiv:1611.04033},\n year={2016}\n}\n", "homepage": "http://abuelkhair.net/index.php/en/arabic/abu-el-khair-corpus", "license": "", "features": {"url": {"dtype": "string", "id": null, "_type": "Value"}, "head_line": {"dtype": "string", "id": null, "_type": "Value"}, "date": {"dtype": "string", "id": null, "_type": "Value"}, "text": {"dtype": "string", "id": null, "_type": "Value"}}, "post_processed": null, "supervised_keys": null, "task_templates": null, "builder_name": "arabic_billion_words", "config_name": "SaudiYoum", "version": {"version_str": "1.1.0", "description": null, "major": 1, "minor": 1, "patch": 0}, "splits": {"train": {"name": "train", "num_bytes": 2723291416, "num_examples": 888068, "dataset_name": "arabic_billion_words"}}, "download_checksums": {"http://abuelkhair.net/corpus/SaudiYoum_XML_utf_8.rar": {"num_bytes": 605537923, "checksum": "d4cbb5554acb03fb7ce271a0b708c1bc6bcf31593ae8c670bed7f8c22335a915"}}, "download_size": 605537923, "post_processing_size": null, "dataset_size": 2723291416, "size_in_bytes": 3328829339}, "Techreen": {"description": "Abu El-Khair Corpus is an Arabic text corpus, that includes more than five million newspaper articles.\nIt contains over a billion and a half words in total, out of which, there are about three million unique words.\nThe corpus is encoded with two types of encoding, namely: UTF-8, and Windows CP-1256.\nAlso it was marked with two mark-up languages, namely: SGML, and XML.\n", "citation": "@article{el20161,\n title={1.5 billion words arabic corpus},\n author={El-Khair, Ibrahim Abu},\n journal={arXiv preprint arXiv:1611.04033},\n year={2016}\n}\n", "homepage": "http://abuelkhair.net/index.php/en/arabic/abu-el-khair-corpus", "license": "", "features": {"url": {"dtype": "string", "id": null, "_type": "Value"}, "head_line": {"dtype": "string", "id": null, "_type": "Value"}, "date": {"dtype": "string", "id": null, "_type": "Value"}, "text": {"dtype": "string", "id": null, "_type": "Value"}}, "post_processed": null, "supervised_keys": null, "task_templates": null, "builder_name": "arabic_billion_words", "config_name": "Techreen", "version": {"version_str": "1.1.0", "description": null, "major": 1, "minor": 1, "patch": 0}, "splits": {"train": {"name": "train", "num_bytes": 1103458209, "num_examples": 314597, "dataset_name": "arabic_billion_words"}}, "download_checksums": {"http://abuelkhair.net/corpus/Techreen_XML_utf_8.rar": {"num_bytes": 252976781, "checksum": "5e4ab520399069fd38d9d80f4429fc05efaae51a912e1467becfc2686e424770"}}, "download_size": 252976781, "post_processing_size": null, "dataset_size": 1103458209, "size_in_bytes": 1356434990}, "Youm7": {"description": "Abu El-Khair Corpus is an Arabic text corpus, that includes more than five million newspaper articles.\nIt contains over a billion and a half words in total, out of which, there are about three million unique words.\nThe corpus is encoded with two types of encoding, namely: UTF-8, and Windows CP-1256.\nAlso it was marked with two mark-up languages, namely: SGML, and XML.\n", "citation": "@article{el20161,\n title={1.5 billion words arabic corpus},\n author={El-Khair, Ibrahim Abu},\n journal={arXiv preprint arXiv:1611.04033},\n year={2016}\n}\n", "homepage": "http://abuelkhair.net/index.php/en/arabic/abu-el-khair-corpus", "license": "", "features": {"url": {"dtype": "string", "id": null, "_type": "Value"}, "head_line": {"dtype": "string", "id": null, "_type": "Value"}, "date": {"dtype": "string", "id": null, "_type": "Value"}, "text": {"dtype": "string", "id": null, "_type": "Value"}}, "post_processed": null, "supervised_keys": null, "task_templates": null, "builder_name": "arabic_billion_words", "config_name": "Youm7", "version": {"version_str": "1.1.0", "description": null, "major": 1, "minor": 1, "patch": 0}, "splits": {"train": {"name": "train", "num_bytes": 3004689464, "num_examples": 1172136, "dataset_name": "arabic_billion_words"}}, "download_checksums": {"http://abuelkhair.net/corpus/Youm7_XML_utf_8.rar": {"num_bytes": 617708074, "checksum": "cd81aa0b3d74e5d9a07377369ea473d8a7bd51cb5826e9809d700de2ddeffe23"}}, "download_size": 617708074, "post_processing_size": null, "dataset_size": 3004689464, "size_in_bytes": 3622397538}}

gitattributes.txt ADDED Viewed

	@@ -0,0 +1,27 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bin.* filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zstandard filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text