ccasimiro commited on
Commit
8b48c21
·
1 Parent(s): 59f67a7

update readme

Browse files
Files changed (1) hide show
  1. README.md +67 -10
README.md CHANGED
@@ -18,7 +18,7 @@ tags:
18
 
19
  datasets:
20
 
21
- - "projecte-aina/tecla"
22
 
23
  metrics:
24
  - accuracy
@@ -29,13 +29,12 @@ model-index:
29
  - task:
30
  type: text-classification
31
  dataset:
32
- name: tecla
33
  type: projecte-aina/tecla
34
  metrics:
35
  - name: Accuracy
36
  type: accuracy
37
  value: 0.740388810634613
38
-
39
  widget:
40
 
41
  - text: "Els Pets presenten el seu nou treball al Palau Sant Jordi."
@@ -50,14 +49,60 @@ widget:
50
 
51
  ---
52
 
53
- # Catalan BERTa (RoBERTa-base) finetuned for Text Classification.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
54
 
55
- The **roberta-base-ca-cased-tc** is a Text Classification (TC) model for the Catalan language fine-tuned from the [BERTa](https://huggingface.co/PlanTL-GOB-ES/roberta-base-ca) model, a [RoBERTa](https://arxiv.org/abs/1907.11692) base model pre-trained on a medium-size corpus collected from publicly available corpora and crawlers (check the BERTa model card for more details).
56
 
57
- ## Datasets
58
- We used the TC dataset in Catalan called [TeCla](https://huggingface.co/datasets/projecte-aina/viquiquad) for training and evaluation.
59
 
60
- ## Evaluation and results
 
 
 
 
 
 
 
 
 
61
  We evaluated the _roberta-base-ca-cased-tc_ on the TeCla test set against standard multilingual and monolingual baselines:
62
 
63
  | Model | TeCla (accuracy) |
@@ -69,7 +114,11 @@ We evaluated the _roberta-base-ca-cased-tc_ on the TeCla test set against standa
69
 
70
  For more details, check the fine-tuning and evaluation scripts in the official [GitHub repository](https://github.com/projecte-aina/club).
71
 
72
- ## Citing
 
 
 
 
73
  If you use any of these resources (datasets or models) in your work, please cite our latest paper:
74
  ```bibtex
75
  @inproceedings{armengol-estape-etal-2021-multilingual,
@@ -91,4 +140,12 @@ If you use any of these resources (datasets or models) in your work, please cite
91
  doi = "10.18653/v1/2021.findings-acl.437",
92
  pages = "4933--4946",
93
  }
94
- ```
 
 
 
 
 
 
 
 
 
18
 
19
  datasets:
20
 
21
+ - "projecte-aina/tecla"
22
 
23
  metrics:
24
  - accuracy
 
29
  - task:
30
  type: text-classification
31
  dataset:
32
+ name: TeCla
33
  type: projecte-aina/tecla
34
  metrics:
35
  - name: Accuracy
36
  type: accuracy
37
  value: 0.740388810634613
 
38
  widget:
39
 
40
  - text: "Els Pets presenten el seu nou treball al Palau Sant Jordi."
 
49
 
50
  ---
51
 
52
+ # Catalan BERTa (roberta-base-ca) finetuned for Text Classification.
53
+
54
+ ## Table of Contents
55
+ - [Model Description](#model-description)
56
+ - [Intended Uses and Limitations](#intended-uses-and-limitations)
57
+ - [How to Use](#how-to-use)
58
+ - [Training](#training)
59
+ - [Training Data](#training-data)
60
+ - [Training Procedure](#training-procedure)
61
+ - [Evaluation](#evaluation)
62
+ - [Variable and Metrics](#variable-and-metrics)
63
+ - [Evaluation Results](#evaluation-results)
64
+ - [Licensing Information](#licensing-information)
65
+ - [Citation Information](#citation-information)
66
+ - [Funding](#funding)
67
+ - [Contributions](#contributions)
68
+
69
+ ## Model description
70
+ The **roberta-base-ca-cased-tc** is a Text Classification (TC) model for the Catalan language fine-tuned from the roberta-base-ca model, a [RoBERTa](https://arxiv.org/abs/1907.11692) base model pre-trained on a medium-size corpus collected from publicly available corpora and crawlers.
71
+
72
+ ## Intended Uses and Limitations
73
+
74
+ **roberta-base-ca-cased-tc** model can be used to classify texts. The model is limited by its training dataset and may not generalize well for all use cases.
75
+
76
+ ## How to Use
77
+
78
+ Here is how to use this model:
79
+
80
+ ```python
81
+ from transformers import pipeline
82
+ from pprint import pprint
83
+
84
+ nlp = pipeline("text-classification", model="projecte-aina/roberta-base-ca-cased-tc")
85
+ example = "Retards a quatre línies de Rodalies per una avaria entre Sants i plaça de Catalunya."
86
+
87
+ tc_results = nlp(example)
88
+ pprint(tc_results)
89
+ ```
90
 
91
+ ## Training
92
 
93
+ ### Training data
94
+ We used the TC dataset in Catalan called [TeCla](https://huggingface.co/datasets/projecte-aina/tecla) for training and evaluation.
95
 
96
+ ### Training Procedure
97
+ The model was trained with a batch size of 16 and a learning rate of 5e-5 for 5 epochs. We then selected the best checkpoint using the downstream task metric in the corresponding development set and then evaluated it on the test set.
98
+
99
+ ## Evaluation
100
+
101
+ ### Variable and Metrics
102
+
103
+ This model was finetuned maximizing accuracy.
104
+
105
+ ## Evaluation results
106
  We evaluated the _roberta-base-ca-cased-tc_ on the TeCla test set against standard multilingual and monolingual baselines:
107
 
108
  | Model | TeCla (accuracy) |
 
114
 
115
  For more details, check the fine-tuning and evaluation scripts in the official [GitHub repository](https://github.com/projecte-aina/club).
116
 
117
+ ## Licensing Information
118
+
119
+ [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
120
+
121
+ ## Citation Information
122
  If you use any of these resources (datasets or models) in your work, please cite our latest paper:
123
  ```bibtex
124
  @inproceedings{armengol-estape-etal-2021-multilingual,
 
140
  doi = "10.18653/v1/2021.findings-acl.437",
141
  pages = "4933--4946",
142
  }
143
+ ```
144
+
145
+ ### Funding
146
+ This work was funded by the [Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya](https://politiquesdigitals.gencat.cat/ca/inici/index.html#googtrans(ca|en) within the framework of [Projecte AINA](https://politiquesdigitals.gencat.cat/ca/economia/catalonia-ai/aina).
147
+
148
+
149
+ ## Contributions
150
+
151
+ [N/A]