projecte-aina
/

roberta-base-ca-v2

@@ -127,22 +127,22 @@ It contains the following tasks and their related datasets:
  3. Text Classification (TC)
-    **[TeCla](https://doi.org/10.5281/zenodo.4627197)**: consisting of 137k news pieces from the Catalan News Agency ([ACN](https://www.acn.cat/)) corpus
  4. Semantic Textual Similarity (STS)
-    **[Catalan semantic textual similarity](https://doi.org/10.5281/zenodo.4529183)**: consisting of more than 3000 sentence pairs, annotated with the semantic similarity between them,
-    scraped from the [Catalan Textual Corpus](https://doi.org/10.5281/zenodo.4519349)
  5. Question Answering (QA):
-    **[ViquiQuAD](https://doi.org/10.5281/zenodo.4562344)**: consisting of more than 15,000 questions outsourced from Catalan Wikipedia randomly chosen from a set of 596 articles that were originally written in Catalan.
-    **[VilaQuAD](https://doi.org/10.5281/zenodo.4562337)**: contains 6,282 pairs of questions and answers, outsourced from 2095 Catalan language articles from VilaWeb newswire text.
-    **[CatalanQA](projecte-aina/catalanqa)**: an aggregation of 2 previous datasets (VilaQuAD and ViquiQuAD), 21,427 pairs of Q/A balanced by type of question, containing one question and one answer per context, although the contexts can repeat multiple times.
-    **[XQuAD](https://doi.org/10.5281/zenodo.4526223)**: the Catalan translation of XQuAD, a multilingual collection of manual translations of 1,190 question-answer pairs from English Wikipedia used only as a _test set_
 Here are the train/dev/test splits of the datasets:

  3. Text Classification (TC)
+    **[TeCla](https://huggingface.co/datasets/projecte-aina/tecla)**: consisting of 137k news pieces from the Catalan News Agency ([ACN](https://www.acn.cat/)) corpus, with 30 labels
  4. Semantic Textual Similarity (STS)
+    **[Catalan semantic textual similarity](https://huggingface.co/datasets/projecte-aina/sts-ca)**: consisting of more than 3000 sentence pairs, annotated with the semantic similarity between them,
+    scraped from the [Catalan Textual Corpus](https://huggingface.co/datasets/projecte-aina/catalan_textual_corpus)
  5. Question Answering (QA):
+    **[ViquiQuAD](https://huggingface.co/datasets/projecte-aina/viquiquad)**: consisting of more than 15,000 questions outsourced from Catalan Wikipedia randomly chosen from a set of 596 articles that were originally written in Catalan.
+    **[VilaQuAD](https://huggingface.co/datasets/projecte-aina/vilaquad)**: contains 6,282 pairs of questions and answers, outsourced from 2095 Catalan language articles from VilaWeb newswire text.
+    **[CatalanQA](https://huggingface.co/datasets/projecte-aina/catalanqa)**: an aggregation of 2 previous datasets (VilaQuAD and ViquiQuAD), 21,427 pairs of Q/A balanced by type of question, containing one question and one answer per context, although the contexts can repeat multiple times.
+    **[XQuAD](https://huggingface.co/datasets/projecte-aina/xquad-ca)**: the Catalan translation of XQuAD, a multilingual collection of manual translations of 1,190 question-answer pairs from English Wikipedia used only as a _test set_
 Here are the train/dev/test splits of the datasets: