license: mit
datasets:
- mlengineer-ai/jomleh
language:
- fa
metrics:
- perplexity
tags:
- kneser-ney
- n-gram
- kenlm
KenLM models for Farsi
This repository contains trained KenLM models for Farsi (Persian) language trained on the Jomleh dataset. Among all the use cases for the language models like KenLM, the models provided here are very useful for ASR (automatic speech recognition) task. They can be used along with CTC to select the more likely sequence of tokens extracted from spectogram.
The models in this repository are KenLM arpa files turned into binary. KenLM supports two types of binary formats: probing and trie. The models provided here are of the probing format. KenLM claims that they are faster but with bigger memory footprint.
There are a total 36 different KenLM models that you can find here. Unless you are doing some research, you won't be needing all of them. If that's the case, I suggest downloading the ones you need and not the whole repository. As the total size of files is larger than half a TB.
Sample code how to use the models
Unfortunately, I could not find an easy way to integrate the Python code that loads the models using Huggingface library. These are the steps that you have to take if you want to use any of the models provided here:
- Install KenLM package:
pip install https://github.com/kpu/kenlm/archive/master.zip
- Install the SentencePiece for the tokenization:
pip install sentencepiece
- Download the model that you are interested in from this repository along the Python code
model.py
. Keep the model in thefiles
folder with themodel.py
by it (just like the file structure in the repository). Don't forget to download the SentencePiece files as well. For instance, if you were interested in 32000 vocabulary size tokenizer, 5-gram model with maximum pruning, these are the files you'll need:
model.py
files/jomleh-sp-32000.model
files/jomleh-sp-32000.vocab
files/jomleh-sp-32000-o5-prune01111.probing
- Write your own code to instantiate a model and use it:
What are the different models provided here
There a total of 36 models in this repository and while all of the are trained on Jomleh daatset, which is a Farsi dataset, there differences among them. Namely:
- Different vocabulary sizes: For research purposes, I trained on 6 different vocabulary sizes. Of course, the vocabulary size is a hyperparameter for the tokenizer (SentencePiece here) but once you have a new tokenizer, it will result in a new model. The different vocabulary sizes used here are: 2000, 4000, 8000, 16000, 32000, and 57218 tokens. For most use cases, ethier 32000 or 57218 token vocabulary size should be the best option.