File size: 5,684 Bytes
20f9940 6864ed8 20f9940 6864ed8 c9bf790 20f9940 0e78a90 6864ed8 20f9940 bc7c5f6 6864ed8 20f9940 6864ed8 aea99f0 20f9940 aea99f0 c9bf790 bc7c5f6 c9bf790 20f9940 cc009e2 20f9940 bc7c5f6 ff2fccb cc009e2 ff2fccb 6864ed8 bc7c5f6 20f9940 bc7c5f6 20f9940 bc7c5f6 6864ed8 bc7c5f6 6864ed8 bc7c5f6 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 |
---
language:
- es
thumbnail: "url to a thumbnail used in social sharing"
license: apache-2.0
datasets:
- oscar
---
# SELECTRA: A Spanish ELECTRA
SELECTRA is a Spanish pre-trained language model based on [ELECTRA](https://github.com/google-research/electra).
We release a `small` and `medium` version with the following configuration:
| Model | Layers | Embedding/Hidden Size | Params | Vocab Size | Max Sequence Length | Cased |
| --- | --- | --- | --- | --- | --- | --- |
| **SELECTRA small** | **12** | **256** | **22M** | **50k** | **512** | **True** |
| [SELECTRA medium](https://huggingface.co/Recognai/selectra_medium) | 12 | 384 | 41M | 50k | 512 | True |
**SELECTRA small (medium) is about 5 (3) times smaller than BETO but achieves comparable results** (see Metrics section below).
## Usage
From the original [ELECTRA model card](https://huggingface.co/google/electra-small-discriminator): "ELECTRA models are trained to distinguish "real" input tokens vs "fake" input tokens generated by another neural network, similar to the discriminator of a GAN."
The discriminator should therefore activate the logit corresponding to the fake input token, as the following example demonstrates:
```python
from transformers import ElectraForPreTraining, ElectraTokenizerFast
discriminator = ElectraForPreTraining.from_pretrained("Recognai/selectra_small")
tokenizer = ElectraTokenizerFast.from_pretrained("Recognai/selectra_small")
sentence_with_fake_token = "Estamos desayunando pan rosa con tomate y aceite de oliva."
inputs = tokenizer.encode(sentence_with_fake_token, return_tensors="pt")
logits = discriminator(inputs).logits.tolist()[0]
print("\t".join(tokenizer.tokenize(sentence_with_fake_token)))
print("\t".join(map(lambda x: str(x)[:4], logits[1:-1])))
"""Output:
Estamos desayun ##ando pan rosa con tomate y aceite de oliva .
-3.1 -3.6 -6.9 -3.0 0.19 -4.5 -3.3 -5.1 -5.7 -7.7 -4.4 -4.2
"""
```
However, you probably want to use this model to fine-tune it on a downstream task.
We provide models fine-tuned on the [XNLI dataset](https://huggingface.co/datasets/xnli), which can be used together with the zero-shot classification pipeline:
- [Zero-shot SELECTRA small](https://huggingface.co/Recognai/zeroshot_selectra_small)
- [Zero-shot SELECTRA medium](https://huggingface.co/Recognai/zeroshot_selectra_medium)
## Metrics
We fine-tune our models on 3 different down-stream tasks:
- [XNLI](https://huggingface.co/datasets/xnli)
- [PAWS-X](https://huggingface.co/datasets/paws-x)
- [CoNLL2002 - NER](https://huggingface.co/datasets/conll2002)
For each task, we conduct 5 trials and state the mean and standard deviation of the metrics in the table below.
To compare our results to other Spanish language models, we provide the same metrics taken from the [evaluation table](https://github.com/PlanTL-SANIDAD/lm-spanish#evaluation-) of the [Spanish Language Model](https://github.com/PlanTL-SANIDAD/lm-spanish) repo.
| Model | CoNLL2002 - NER (f1) | PAWS-X (acc) | XNLI (acc) | Params |
| --- | --- | --- | --- | --- |
| SELECTRA small | 0.865 +- 0.004 | 0.896 +- 0.002 | 0.784 +- 0.002 | 22M |
| SELECTRA medium | 0.873 +- 0.003 | 0.896 +- 0.002 | 0.804 +- 0.002 | 41M |
| | | | | |
| [mBERT](https://huggingface.co/bert-base-multilingual-cased) | 0.8691 | 0.8955 | 0.7876 | 178M |
| [BETO](https://huggingface.co/dccuchile/bert-base-spanish-wwm-cased) | 0.8759 | 0.9000 | 0.8130 | 110M |
| [RoBERTa-b](https://huggingface.co/BSC-TeMU/roberta-base-bne) | 0.8851 | 0.9000 | 0.8016 | 125M |
| [RoBERTa-l](https://huggingface.co/BSC-TeMU/roberta-large-bne) | 0.8772 | 0.9060 | 0.7958 | 355M |
| [Bertin](https://huggingface.co/bertin-project/bertin-roberta-base-spanish/tree/v1-512) | 0.8835 | 0.8990 | 0.7890 | 125M |
| [ELECTRICIDAD](https://huggingface.co/mrm8488/electricidad-base-discriminator) | 0.7954 | 0.9025 | 0.7878 | 109M |
Some details of our fine-tuning runs:
- epochs: 5
- batch-size: 32
- learning rate: 1e-4
- warmup proportion: 0.1
- linear learning rate decay
- layerwise learning rate decay
For all the details, check out our [selectra repo](https://github.com/recognai/selectra).
## Training
We pre-trained our SELECTRA models on the Spanish portion of the [Oscar](https://huggingface.co/datasets/oscar) dataset, which is about 150GB in size.
Each model version is trained for 300k steps, with a warm restart of the learning rate after the first 150k steps.
Some details of the training:
- steps: 300k
- batch-size: 128
- learning rate: 5e-4
- warmup steps: 10k
- linear learning rate decay
- TPU cores: 8 (v2-8)
For all details, check out our [selectra repo](https://github.com/recognai/selectra).
**Note:** Due to a misconfiguration in the pre-training scripts the embeddings of the vocabulary containing an accent were not optimized. If you fine-tune this model on a down-stream task, you might consider using a tokenizer that does not strip the accents:
```python
tokenizer = ElectraTokenizerFast.from_pretrained("Recognai/selectra_small", strip_accents=False)
```
## Motivation
Despite the abundance of excellent Spanish language models (BETO, BSC-BNE, Bertin, ELECTRICIDAD, etc.), we felt there was still a lack of distilled or compact Spanish language models and a lack of comparing those to their bigger siblings.
## Acknowledgment
This research was supported by the Google TPU Research Cloud (TRC) program.
## Authors
- David Fidalgo ([GitHub](https://github.com/dcfidalgo))
- Javier Lopez ([GitHub](https://github.com/javispp))
- Daniel Vila ([GitHub](https://github.com/dvsrepo))
- Francisco Aranda ([GitHub](https://github.com/frascuchon)) |