edugp
/

kenlm

edugp commited on Jan 7, 2022

Commit

5868dfb

•

1 Parent(s): 13451d2

Add metadata to model card

Files changed (1) hide show

README.md CHANGED Viewed

@@ -1,3 +1,41 @@
 # KenLM models
 This repo contains several KenLM models trained on different tokenized datasets and languages.
 KenLM models are probabilistic n-gram languge models that models. One use case of these models consist on fast perplexity estimation for [filtering or sampling large datasets](https://huggingface.co/bertin-project/bertin-roberta-base-spanish). For example, one could use a KenLM model trained on French Wikipedia to run inference on a large dataset and filter out samples that are very unlike to appear on Wikipedia (high perplexity), or very simple non-informative sentences that could appear repeatedly (low perplexity).

+---
+language:
+  - es
+  - af
+  - ar
+  - arz
+  - as
+  - bn
+  - fr
+  - sw
+  - eu
+  - ca
+  - zh
+  - en
+  - hi
+  - ur
+  - id
+  - pt
+  - vi
+  - gu
+  - kn
+  - ml
+  - mr
+  - ta
+  - te
+  - yo
+tags:
+- KenLM
+- Perplexity
+- n-gram
+- Kneser-Ney
+- BigScience
+license: "mit"
+datasets:
+- wikipedia
+- oscar
+---
 # KenLM models
 This repo contains several KenLM models trained on different tokenized datasets and languages.
 KenLM models are probabilistic n-gram languge models that models. One use case of these models consist on fast perplexity estimation for [filtering or sampling large datasets](https://huggingface.co/bertin-project/bertin-roberta-base-spanish). For example, one could use a KenLM model trained on French Wikipedia to run inference on a large dataset and filter out samples that are very unlike to appear on Wikipedia (high perplexity), or very simple non-informative sentences that could appear repeatedly (low perplexity).