bsc-temu commited on
Commit
c1c2dea
1 Parent(s): 7903afe

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +95 -0
README.md CHANGED
@@ -0,0 +1,95 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+
3
+ language:
4
+
5
+ - ca
6
+
7
+ license: ???
8
+
9
+ tags:
10
+
11
+ - "catalan"
12
+
13
+ - "semantic textual similarity"
14
+
15
+ - "sts-ca"
16
+
17
+ - "CaText"
18
+
19
+ - "Catalan Textual Corpus"
20
+
21
+ datasets:
22
+
23
+ - "projecte-aina/sts-ca"
24
+
25
+ metrics:
26
+
27
+ - "pearson"
28
+
29
+
30
+ model-index:
31
+ - name: roberta-base-ca-cased-sts
32
+ results:
33
+ - task:
34
+ type: text-classification # Required. Example: automatic-speech-recognition
35
+ dataset:
36
+ type: projecte-aina/sts-ca
37
+ name: sts-ca
38
+ metrics:
39
+ - type: pearson
40
+ value: 0.8120486139447483
41
+
42
+ widget:
43
+
44
+ - text: "M'agrades. T'estimo."
45
+
46
+ - text: "M'agrada el sol i la calor. A la Garrotxa plou molt."
47
+
48
+ - text: "El llibre va caure per la finestra. El llibre va sortir volant."
49
+
50
+ - text: "El meu aniversari és el 23 de maig. Faré anys a finals de maig."
51
+
52
+ ---
53
+
54
+ # Catalan BERTa (RoBERTa-base) finetuned for Textual Entailment.
55
+
56
+ The **roberta-base-ca-cased-sts** is a Semantic Textual Similarity (STS) model for the Catalan language fine-tuned from the [BERTa](https://huggingface.co/PlanTL-GOB-ES/roberta-base-ca) model, a [RoBERTa](https://arxiv.org/abs/1907.11692) base model pre-trained on a medium-size corpus collected from publicly available corpora and crawlers (check the BERTa model card for more details).
57
+
58
+ ## Datasets
59
+ We used the TE dataset in Catalan called [STS-ca](https://huggingface.co/datasets/projecte-aina/sts-ca) for training and evaluation.
60
+
61
+ ## Evaluation and results
62
+ Below, the evaluation result on the STS-ca test set:
63
+
64
+ | Task | STS-ca (pearson) |
65
+ | ------------|:----|
66
+ | BERTa | **81.20** |
67
+ For more details, check the fine-tuning and evaluation scripts in the official [GitHub repository](https://github.com/projecte-aina/berta).
68
+
69
+ ## Citing
70
+ If you use any of these resources (datasets or models) in your work, please cite our latest paper:
71
+ ```bibtex
72
+ @inproceedings{armengol-estape-etal-2021-multilingual,
73
+ title = "Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? {A} Comprehensive Assessment for {C}atalan",
74
+ author = "Armengol-Estap{\'e}, Jordi and
75
+ Carrino, Casimiro Pio and
76
+ Rodriguez-Penagos, Carlos and
77
+ de Gibert Bonet, Ona and
78
+ Armentano-Oller, Carme and
79
+ Gonzalez-Agirre, Aitor and
80
+ Melero, Maite and
81
+ Villegas, Marta",
82
+ booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
83
+ month = aug,
84
+ year = "2021",
85
+ address = "Online",
86
+ publisher = "Association for Computational Linguistics",
87
+ url = "https://aclanthology.org/2021.findings-acl.437",
88
+ doi = "10.18653/v1/2021.findings-acl.437",
89
+ pages = "4933--4946",
90
+ }
91
+ ```
92
+ ## Funding
93
+ TODO
94
+ ## Disclaimer
95
+ TODO