adno commited on
Commit
acf58cd
·
verified ·
1 Parent(s): 5f6a056

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +47 -3
README.md CHANGED
@@ -1,3 +1,47 @@
1
- ---
2
- license: bsd-3-clause
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: bsd-3-clause
3
+ language:
4
+ - zh
5
+ - en
6
+ - id
7
+ - ja
8
+ - es
9
+ ---
10
+
11
+ # TUBELEX Statistical Language Models
12
+
13
+ N-gram models on the TUBELEX YouTube subtitle corpora. We provide modified Kneser-Ney language models of order 5 ([Heafield et al., 2013](https://aclanthology.org/P13-2121)), i.e. [KenLM](https://kheafield.com/code/kenlm/) models.
14
+
15
+ The files are in LZMA-compressed ARPA format.
16
+
17
+ # What is TUBELEX?
18
+
19
+ TUBELEX is a YouTube subtitle corpus currently available for Chinese, English, Indonesian, Japanese, and Spanish.
20
+
21
+ - TODO: paper link
22
+ - [fastText word embeddings](https://huggingface.co/naist-nlp/tubelex-fasttext)
23
+ - [word frequencies and code](https://github.com/naist-nlp/tubelex)
24
+
25
+ # Usage
26
+
27
+ To download and use the KenLM models in Python, first install dependencies:
28
+
29
+ ```
30
+ pip install huggingface_hub
31
+ pip install https://github.com/kpu/kenlm/archive/master.zip
32
+ ```
33
+
34
+ You can then use e.g. the English (`en`) model in the following way:
35
+
36
+ ```
37
+ import kenlm
38
+ from huggingface_hub import hf_hub_download
39
+
40
+ model_file = hf_hub_download(repo_id='naist-nlp/tubelex-kenlm', filename='tubelex-en.arpa.xz')
41
+ # Loading the model requires KenLM to be compiled with LZMA support (`HAVE_XZLIB`).
42
+ # Otherwise you fill first need to decompress the model.
43
+ model = kenlm.Model(model_file)
44
+
45
+ text = ''a sequence of words' # pre-tokenized, lower-cased, without punctuation
46
+ model.perplexity(text)
47
+ ```