naist-nlp
/

tubelex-kenlm

Model card Files Files and versions Community

adno commited on Oct 4, 2024

Commit

acf58cd

·

verified ·

1 Parent(s): 5f6a056

Update README.md

Files changed (1) hide show

README.md +47 -3

README.md CHANGED Viewed

@@ -1,3 +1,47 @@
----
-license: bsd-3-clause
----

+---
+license: bsd-3-clause
+language:
+- zh
+- en
+- id
+- ja
+- es
+---
+# TUBELEX Statistical Language Models
+N-gram models on the TUBELEX YouTube subtitle corpora. We provide modified Kneser-Ney language models of order 5 ([Heafield et al., 2013](https://aclanthology.org/P13-2121)), i.e. [KenLM](https://kheafield.com/code/kenlm/) models.
+The files are in LZMA-compressed ARPA format.
+# What is TUBELEX?
+TUBELEX is a YouTube subtitle corpus currently available for Chinese, English, Indonesian, Japanese, and Spanish.
+- TODO: paper link
+- [fastText word embeddings](https://huggingface.co/naist-nlp/tubelex-fasttext)
+- [word frequencies and code](https://github.com/naist-nlp/tubelex)
+# Usage
+To download and use the KenLM models in Python, first install dependencies:
+```
+pip install huggingface_hub
+pip install https://github.com/kpu/kenlm/archive/master.zip
+```
+You can then use e.g. the English (`en`) model in the following way:
+```
+import kenlm
+from huggingface_hub import hf_hub_download
+model_file = hf_hub_download(repo_id='naist-nlp/tubelex-kenlm', filename='tubelex-en.arpa.xz')
+# Loading the model requires KenLM to be compiled with LZMA support (`HAVE_XZLIB`).
+# Otherwise you fill first need to decompress the model.
+model = kenlm.Model(model_file)
+text = ''a sequence of words'  # pre-tokenized, lower-cased, without punctuation
+model.perplexity(text)
+```