canine-c-lang-id / README.md

Add multilingual to the language tag (#1)

13e2e85 almost 2 years ago

4.66 kB

	---
	language:
	- ace
	- af
	- als
	- am
	- an
	- ang
	- ar
	- arz
	- as
	- ast
	- av
	- ay
	- az
	- azb
	- ba
	- bar
	- bcl
	- be
	- bg
	- bho
	- bjn
	- bn
	- bo
	- bpy
	- br
	- bs
	- bxr
	- ca
	- cbk
	- cdo
	- ce
	- ceb
	- chr
	- ckb
	- co
	- crh
	- cs
	- csb
	- cv
	- cy
	- da
	- de
	- diq
	- dsb
	- dty
	- dv
	- egl
	- el
	- en
	- eo
	- es
	- et
	- eu
	- ext
	- fa
	- fi
	- fo
	- fr
	- frp
	- fur
	- fy
	- ga
	- gag
	- gd
	- gl
	- glk
	- gn
	- gu
	- gv
	- ha
	- hak
	- he
	- hi
	- hif
	- hr
	- hsb
	- ht
	- hu
	- hy
	- ia
	- id
	- ie
	- ig
	- ilo
	- io
	- is
	- it
	- ja
	- jam
	- jbo
	- jv
	- ka
	- kaa
	- kab
	- kbd
	- kk
	- km
	- kn
	- ko
	- koi
	- kok
	- krc
	- ksh
	- ku
	- kv
	- kw
	- ky
	- la
	- lad
	- lb
	- lez
	- lg
	- li
	- lij
	- lmo
	- ln
	- lo
	- lrc
	- lt
	- ltg
	- lv
	- lzh
	- mai
	- map
	- mdf
	- mg
	- mhr
	- mi
	- min
	- mk
	- ml
	- mn
	- mr
	- mrj
	- ms
	- mt
	- mwl
	- my
	- myv
	- mzn
	- nan
	- nap
	- nb
	- nci
	- nds
	- ne
	- new
	- nl
	- nn
	- nrm
	- nso
	- nv
	- oc
	- olo
	- om
	- or
	- os
	- pa
	- pag
	- pam
	- pap
	- pcd
	- pdc
	- pfl
	- pl
	- pnb
	- ps
	- pt
	- qu
	- rm
	- ro
	- roa
	- ru
	- rue
	- rup
	- rw
	- sa
	- sah
	- sc
	- scn
	- sco
	- sd
	- sgs
	- sh
	- si
	- sk
	- sl
	- sme
	- sn
	- so
	- sq
	- sr
	- srn
	- stq
	- su
	- sv
	- sw
	- szl
	- ta
	- tcy
	- te
	- tet
	- tg
	- th
	- tk
	- tl
	- tn
	- to
	- tr
	- tt
	- tyv
	- udm
	- ug
	- uk
	- ur
	- uz
	- vec
	- vep
	- vi
	- vls
	- vo
	- vro
	- wa
	- war
	- wo
	- wuu
	- xh
	- xmf
	- yi
	- yo
	- zea
	- zh
	- multilingual
	license: apache-2.0
	tags:
	- Language Identification
	datasets:
	- wili_2018
	metrics:
	- accuracy
	- macro F1-score
	language_bcp47:
	- be-tarask
	- map-bms
	- nds-nl
	- roa-tara
	- zh-yue
	---
	# Canine for Language Identification
	Canine model trained on WiLI-2018 dataset to identify the language of a text.

	### Preprocessing
	- 10% of train data stratified sampled as validation set
	- max sequence length: 512

	### Hyperparameters
	- epochs: 4
	- learning-rate: 3e-5
	- batch size: 16
	- gradient_accumulation: 4
	- optimizer: AdamW with default settings

	### Test Results
	- Accuracy: 94,92%
	- Macro F1-score: 94,91%

	### Inference
	Dictionary to return English names for a label id:
	```python
	import datasets
	import pycountry
	def int_to_lang():
	dataset = datasets.load_dataset('wili_2018')
	# names for languages not in iso-639-3 from wikipedia
	non_iso_languages = {'roa-tara': 'Tarantino', 'zh-yue': 'Cantonese', 'map-bms': 'Banyumasan',
	'nds-nl': 'Dutch Low Saxon', 'be-tarask': 'Belarusian'}
	# create dictionary from data set labels to language names
	lab_to_lang = {}
	for i, lang in enumerate(dataset['train'].features['label'].names):
	full_lang = pycountry.languages.get(alpha_3=lang)
	if full_lang:
	lab_to_lang[i] = full_lang.name
	else:
	lab_to_lang[i] = non_iso_languages[lang]
	return lab_to_lang
	```

	### Credit to
	```
	@article{clark-etal-2022-canine,
	title = "Canine: Pre-training an Efficient Tokenization-Free Encoder for Language Representation",
	author = "Clark, Jonathan H. and
	Garrette, Dan and
	Turc, Iulia and
	Wieting, John",
	journal = "Transactions of the Association for Computational Linguistics",
	volume = "10",
	year = "2022",
	address = "Cambridge, MA",
	publisher = "MIT Press",
	url = "https://aclanthology.org/2022.tacl-1.5",
	doi = "10.1162/tacl_a_00448",
	pages = "73--91",
	abstract = "Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly used models still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword lexicons are less brittle than manually engineered tokenizers, these techniques are not equally suited to all languages, and the use of any fixed vocabulary may limit a model{'}s ability to adapt. In this paper, we present Canine, a neural encoder that operates directly on character sequences{---}without explicit tokenization or vocabulary{---}and a pre-training strategy that operates either directly on characters or optionally uses subwords as a soft inductive bias. To use its finer-grained input effectively and efficiently, Canine combines downsampling, which reduces the input sequence length, with a deep transformer stack, which encodes context. Canine outperforms a comparable mBert model by 5.7 F1 on TyDi QA, a challenging multilingual benchmark, despite having fewer model parameters.",
	}
	@dataset{thoma_martin_2018_841984,
	author = {Thoma, Martin},
	title = {{WiLI-2018 - Wikipedia Language Identification
	database}},
	month = jan,
	year = 2018,
	publisher = {Zenodo},
	version = {1.0.0},
	doi = {10.5281/zenodo.841984},
	url = {https://doi.org/10.5281/zenodo.841984}
	}
	```