ccasimiro commited on
Commit
47020bb
1 Parent(s): 1e644c6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +83 -25
README.md CHANGED
@@ -1,51 +1,72 @@
1
  ---
 
 
2
  language:
 
3
  - ca
4
- pipeline_tag: text-classification
5
  license: apache-2.0
 
6
  tags:
 
7
  - "catalan"
 
8
  - "semantic textual similarity"
 
9
  - "sts-ca"
 
10
  - "CaText"
 
11
  - "Catalan Textual Corpus"
 
12
  datasets:
13
- - "projecte-aina/sts-ca"
 
 
14
  metrics:
15
- - "pearson"
 
 
16
  model-index:
 
17
  - name: roberta-base-ca-cased-sts
18
  results:
19
  - task:
20
  type: text-classification
21
  dataset:
22
  type: projecte-aina/sts-ca
23
- name: sts-ca
24
  metrics:
25
- - type: pearson
26
- value: 0.7973
 
27
 
28
  ---
29
 
30
- # Catalan BERTa (RoBERTa-base) finetuned for Semantic Textual Similarity.
31
 
32
- The **roberta-base-ca-cased-sts** is a Semantic Textual Similarity (STS) model for the Catalan language fine-tuned from the [BERTa](https://huggingface.co/PlanTL-GOB-ES/roberta-base-ca) model, a [RoBERTa](https://arxiv.org/abs/1907.11692) base model pre-trained on a medium-size corpus collected from publicly available corpora and crawlers (check the BERTa model card for more details).
 
 
 
 
 
 
 
 
 
 
 
 
 
33
 
34
- ## Datasets
35
- We used the STS dataset in Catalan called [STS-ca](https://huggingface.co/datasets/projecte-aina/sts-ca) for training and evaluation.
36
 
37
- ## Evaluation and results
38
- We evaluated the _roberta-base-ca-cased-sts_ on the STS-ca test set against standard multilingual and monolingual baselines:
39
 
40
- | Model | STS-ca (Pearson) |
41
- |:------------|:----|
42
- | roberta-base-ca-cased-sts | **79.73** |
43
- | mBERT | 76.34 |
44
- | XLM-RoBERTa | 75.40 |
45
- | WikiBERT-ca | 77.18 |
46
 
47
-
48
- For more details, check the fine-tuning and evaluation scripts in the official [GitHub repository](https://github.com/projecte-aina/club).
49
 
50
  ## How to use
51
  To get the correct<sup>1</sup> model's prediction scores with values between 0.0 and 5.0, use the following code:
@@ -70,20 +91,50 @@ sentence_pairs = [("El llibre va caure per la finestra.", "El llibre va sortir v
70
 
71
  predictions = pipe(prepare(sentence_pairs), add_special_tokens=False)
72
 
73
- # convert back to scores to the original 1 and 5 interval
74
  for prediction in predictions:
75
  prediction['score'] = logit(prediction['score'])
76
  print(predictions)
77
  ```
78
  Expected output:
79
  ```
80
- [{'label': 'SIMILARITY', 'score': 2.4280577200108384},
81
- {'label': 'SIMILARITY', 'score': 2.132843521240822},
82
- {'label': 'SIMILARITY', 'score': 1.615101695426227}]
83
  ```
84
 
85
  <sup>1</sup> _**avoid using the widget** scores since they are normalized and do not reflect the original annotation values._
86
- ## Citing
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
87
  If you use any of these resources (datasets or models) in your work, please cite our latest paper:
88
  ```bibtex
89
  @inproceedings{armengol-estape-etal-2021-multilingual,
@@ -106,3 +157,10 @@ If you use any of these resources (datasets or models) in your work, please cite
106
  pages = "4933--4946",
107
  }
108
  ```
 
 
 
 
 
 
 
 
1
  ---
2
+ pipeline_tag: text-classification
3
+
4
  language:
5
+
6
  - ca
7
+
8
  license: apache-2.0
9
+
10
  tags:
11
+
12
  - "catalan"
13
+
14
  - "semantic textual similarity"
15
+
16
  - "sts-ca"
17
+
18
  - "CaText"
19
+
20
  - "Catalan Textual Corpus"
21
+
22
  datasets:
23
+
24
+ - "projecte-aina/sts-ca"
25
+
26
  metrics:
27
+
28
+ - "combined_score"
29
+
30
  model-index:
31
+
32
  - name: roberta-base-ca-cased-sts
33
  results:
34
  - task:
35
  type: text-classification
36
  dataset:
37
  type: projecte-aina/sts-ca
38
+ name: STS-ca
39
  metrics:
40
+ - name: Pearson
41
+ type: Pearson
42
+ value: 0.797
43
 
44
  ---
45
 
46
+ # Catalan BERTa (roberta-base-ca) finetuned for Semantic Textual Similarity.
47
 
48
+ ## Table of Contents
49
+ - [Model Description](#model-description)
50
+ - [Intended Uses and Limitations](#intended-uses-and-limitations)
51
+ - [How to Use](#how-to-use)
52
+ - [Training](#training)
53
+ - [Training Data](#training-data)
54
+ - [Training Procedure](#training-procedure)
55
+ - [Evaluation](#evaluation)
56
+ - [Variable and Metrics](#variable-and-metrics)
57
+ - [Evaluation Results](#evaluation-results)
58
+ - [Licensing Information](#licensing-information)
59
+ - [Citation Information](#citation-information)
60
+ - [Funding](#funding)
61
+ - [Contributions](#contributions)
62
 
63
+ ## Model description
 
64
 
65
+ The **roberta-base-ca-cased-sts** is a Semantic Textual Similarity (STS) model for the Catalan language fine-tuned from the roberta-base-ca model, a [RoBERTa](https://arxiv.org/abs/1907.11692) base model pre-trained on a medium-size corpus collected from publicly available corpora and crawlers (check the roberta-base-ca model card for more details).
 
66
 
67
+ ## Intended Uses and Limitations
 
 
 
 
 
68
 
69
+ **roberta-base-ca-cased-sts** model can be used to assess the similarity between two snippets of text. The model is limited by its training dataset and may not generalize well for all use cases.
 
70
 
71
  ## How to use
72
  To get the correct<sup>1</sup> model's prediction scores with values between 0.0 and 5.0, use the following code:
 
91
 
92
  predictions = pipe(prepare(sentence_pairs), add_special_tokens=False)
93
 
94
+ # convert back to scores to the original 0 and 5 interval
95
  for prediction in predictions:
96
  prediction['score'] = logit(prediction['score'])
97
  print(predictions)
98
  ```
99
  Expected output:
100
  ```
101
+ [{'label': 'SIMILARITY', 'score': 2.118301674983813},
102
+ {'label': 'SIMILARITY', 'score': 2.1799755855125853},
103
+ {'label': 'SIMILARITY', 'score': 0.9511617858568939}]
104
  ```
105
 
106
  <sup>1</sup> _**avoid using the widget** scores since they are normalized and do not reflect the original annotation values._
107
+
108
+ ## Training
109
+
110
+ ### Training data
111
+ We used the STS dataset in Catalan called [STS-ca](https://huggingface.co/datasets/projecte-aina/sts-ca) for training and evaluation.
112
+
113
+ ### Training Procedure
114
+ The model was trained with a batch size of 16 and a learning rate of 5e-5 for 5 epochs. We then selected the best checkpoint using the downstream task metric in the corresponding development set, and then evaluated it on the test set.
115
+
116
+ ## Evaluation
117
+
118
+ ### Variable and Metrics
119
+
120
+ This model was finetuned maximizing the average score between the Pearson and Spearman correlations.
121
+
122
+ ## Evaluation results
123
+ We evaluated the _roberta-base-ca-cased-sts_ on the STS-ca test set against standard multilingual and monolingual baselines:
124
+
125
+ | Model | STS-ca (Pearson score) |
126
+ | ------------|:-------------|
127
+ | roberta-base-ca-cased-sts | 79.73 |
128
+ | mBERT | 74.26 |
129
+ | XLM-RoBERTa | 61.61 |
130
+
131
+ For more details, check the fine-tuning and evaluation scripts in the official [GitHub repository](https://github.com/projecte-aina/club).
132
+
133
+ ## Licensing Information
134
+
135
+ [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
136
+
137
+ ## Citation Information
138
  If you use any of these resources (datasets or models) in your work, please cite our latest paper:
139
  ```bibtex
140
  @inproceedings{armengol-estape-etal-2021-multilingual,
 
157
  pages = "4933--4946",
158
  }
159
  ```
160
+
161
+ ### Funding
162
+ This work was funded by the [Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya](https://politiquesdigitals.gencat.cat/ca/inici/index.html#googtrans(ca|en) within the framework of [Projecte AINA](https://politiquesdigitals.gencat.cat/ca/economia/catalonia-ai/aina).
163
+
164
+ ## Contributions
165
+
166
+ [N/A]