|
--- |
|
license: mit |
|
datasets: |
|
- mlengineer-ai/jomleh |
|
language: |
|
- fa |
|
metrics: |
|
- perplexity |
|
tags: |
|
- kneser-ney |
|
- n-gram |
|
- kenlm |
|
--- |
|
|
|
# KenLM models for Farsi |
|
|
|
This repository contains trained KenLM models for Farsi (Persian) language trained on the Jomleh |
|
dataset. Among all the use cases for the language models like KenLM, the models provided here are |
|
very useful for ASR (automatic speech recognition) task. They can be used along with CTC to select |
|
the more likely sequence of tokens extracted from spectogram. |
|
|
|
The models in this repository are KenLM arpa files turned into binary. KenLM supports two types of |
|
binary formats: probing and trie. The models provided here are of the probing format. KenLM claims |
|
that they are faster but with bigger memory footprint. |
|
|
|
There are a total 36 different KenLM models that you can find here. Unless you are doing some |
|
research, you won't be needing all of them. If that's the case, I suggest downloading the ones you |
|
need and not the whole repository. As the total size of files is larger than half a TB. |
|
|
|
# Sample code how to use the models |
|
|
|
Unfortunately, I could not find an easy way to integrate the Python code that loads the models |
|
using Huggingface library. These are the steps that you have to take if you want to use any of the |
|
models provided here: |
|
|
|
1. Install KenLM package: |
|
|
|
``` |
|
pip install https://github.com/kpu/kenlm/archive/master.zip |
|
``` |
|
|
|
2. Install the SentencePiece for the tokenization: |
|
``` |
|
pip install sentencepiece |
|
``` |
|
|
|
3. Download the model that you are interested in from this repository along the Python code |
|
`model.py`. Keep the model in the `files` folder with the `model.py` by it (just like the file |
|
structure in the repository). Don't forget to download the SentencePiece files as well. For |
|
instance, if you were interested in 32000 vocabulary size tokenizer, 5-gram model with maximum |
|
pruning, these are the files you'll need: |
|
``` |
|
model.py |
|
files/jomleh-sp-32000.model |
|
files/jomleh-sp-32000.vocab |
|
files/jomleh-sp-32000-o5-prune01111.probing |
|
``` |
|
|
|
4. Write your own code to instantiate a model and use it: |
|
|
|
``` |
|
``` |
|
|
|
# What are the different models provided here |
|
|
|
There a total of 36 models in this repository and while all of the are trained on Jomleh daatset, |
|
which is a Farsi dataset, there differences among them. Namely: |
|
|
|
1. Different vocabulary sizes: For research purposes, I trained on 6 different vocabulary sizes. |
|
Of course, the vocabulary size is a hyperparameter for the tokenizer (SentencePiece here) but |
|
once you have a new tokenizer, it will result in a new model. The different vocabulary sizes used |
|
here are: 2000, 4000, 8000, 16000, 32000, and 57218 tokens. For most use cases, ethier 32000 or |
|
57218 token vocabulary size should be the best option. |