lirondos
/

anglicisms-spanish-flair-cs

Token Classification

sequence-tagger-model

arxiv:2203.16169

Model card Files Files and versions Community

lirondos commited on Mar 29, 2022

Commit

9946e30

•

1 Parent(s): 34ef29f

Update README.md

Files changed (1) hide show

README.md +51 -0

README.md CHANGED Viewed

@@ -1,3 +1,54 @@
 ---
 license: cc-by-4.0
 ---

 ---
+language:
+- es
 license: cc-by-4.0
+tags:
+- anglicisms  # Example: audio
+- loanwords  # Example: automatic-speech-recognition
+- borrowing  # Example: speech
+- codeswitching  # Example to specify a library: allennlp
+- flair
+- token-classification
+- sequence-tagger-model
+datasets:
+- coalas  # Example: common_voice. Use dataset id from https://hf.co/datasets
+widget:
+- text: "Las fake news sobre la celebrity se reprodujeron por los 'mass media' en prime time."
+- text: "Me gusta el cine noir y el anime."
+- text: "Benching, estar en el banquillo de tu 'crush' mientras otro juega de titular."
+- text: "Recetas de noviembre para el batch cooking."
+- text: "Utilizaron técnicas de machine learning, big data o blockchain."
 ---
+# anglicisms-spanish-flair-cs
+This is a pretrained model for detecting unassimilated English lexical borrowings (a.k.a. anglicisms) on Spanish newswire. This model labels words of foreign origin (fundamentally from English) used in Spanish language, words such as *fake news*, *machine learning*, *smartwatch*, *influencer* or *streaming*.
+The model is a BiLSTM-CRF model fed with [Transformer-based embeddings pretrained on codeswitched data](https://huggingface.co/sagorsarker/codeswitch-spaeng-lid-lince) along subword embeddings (BPE and character embeddings). The model was trained on the [COALAS](https://github.com/lirondos/coalas/) corpus for the task of detecting lexical borrowings.
+The model considers two labels:
+* ``ENG``: For English lexical borrowings (*smartphone*, *online*, *podcast*)
+* ``OTHER``: For lexical borrowings from any other language (*boutique*, *anime*, *umami*)
+The model uses BIO encoding to account for multitoken borrowings.
+## Metrics (on the test set)
+| LABEL    | Precision | Recall  |  F1    |
+|:-------:|:-----:|:-----:|:---------:|
+| ALL    | 90.14   | 81.79    |   85.76   |
+| ENG    | 90.16   | 84.34    |   87.16   |
+| OTHER  | 85.71   | 13.04     |  22.64     |
+There is another [mBERT -based model](https://huggingface.co/lirondos/anglicisms-spanish-mbert) for this same task trained using the ``Transformers`` library. That model however produced worse results than this Flair-based model (F1 = 83.55).
+## Dataset
+This model was trained on [COALAS](https://github.com/lirondos/coalas/), a corpus of Spanish newswire annotated with unassimilated lexical borrowings. The corpus contains 370,000 tokens and includes various written media written in European Spanish. The test set was designed to be as difficult as possible: it covers sources and dates not seen in the training set, includes a high number of OOV words (92% of the borrowings in the test set are OOV) and is very borrowing-dense (20 borrowings per 1,000 tokens).
+|Set      | Tokens | ENG  | OTHER |  Unique |
+|:-------:|:-----:|:-----:|:---------:|:---------:|
+|Training |231,126 |1,493 | 28 |380 |
+|Development |82,578 |306 |49 |316|
+|Test |58,997 |1,239 |46 |987|
+|Total |372,701 |3,038 |123 |1,683 |