kenlm-sp-jomleh / README.md
mehran's picture
Update README.md
a409788
|
raw
history blame
2.73 kB
---
license: mit
datasets:
- mlengineer-ai/jomleh
language:
- fa
metrics:
- perplexity
tags:
- kneser-ney
- n-gram
- kenlm
---
# KenLM models for Farsi
This repository contains trained KenLM models for Farsi (Persian) language trained on the Jomleh
dataset. Among all the use cases for the language models like KenLM, the models provided here are
very useful for ASR (automatic speech recognition) task. They can be used along with CTC to select
the more likely sequence of tokens extracted from spectogram.
The models in this repository are KenLM arpa files turned into binary. KenLM supports two types of
binary formats: probing and trie. The models provided here are of the probing format. KenLM claims
that they are faster but with bigger memory footprint.
There are a total 36 different KenLM models that you can find here. Unless you are doing some
research, you won't be needing all of them. If that's the case, I suggest downloading the ones you
need and not the whole repository. As the total size of files is larger than half a TB.
# Sample code how to use the models
Unfortunately, I could not find an easy way to integrate the Python code that loads the models
using Huggingface library. These are the steps that you have to take if you want to use any of the
models provided here:
1. Install KenLM package:
```
pip install https://github.com/kpu/kenlm/archive/master.zip
```
2. Install the SentencePiece for the tokenization:
```
pip install sentencepiece
```
3. Download the model that you are interested in from this repository along the Python code
`model.py`. Keep the model in the `files` folder with the `model.py` by it (just like the file
structure in the repository). Don't forget to download the SentencePiece files as well. For
instance, if you were interested in 32000 vocabulary size tokenizer, 5-gram model with maximum
pruning, these are the files you'll need:
```
model.py
files/jomleh-sp-32000.model
files/jomleh-sp-32000.vocab
files/jomleh-sp-32000-o5-prune01111.probing
```
4. Write your own code to instantiate a model and use it:
```
```
# What are the different models provided here
There a total of 36 models in this repository and while all of the are trained on Jomleh daatset,
which is a Farsi dataset, there differences among them. Namely:
1. Different vocabulary sizes: For research purposes, I trained on 6 different vocabulary sizes.
Of course, the vocabulary size is a hyperparameter for the tokenizer (SentencePiece here) but
once you have a new tokenizer, it will result in a new model. The different vocabulary sizes used
here are: 2000, 4000, 8000, 16000, 32000, and 57218 tokens. For most use cases, ethier 32000 or
57218 token vocabulary size should be the best option.