mlengineer-ai
/

kenlm-sp-jomleh

Model card Files Files and versions Community

kenlm-sp-jomleh / README.md

mehran's picture

Update README.md

a409788 over 1 year ago

|

2.73 kB

	---
	license: mit
	datasets:
	- mlengineer-ai/jomleh
	language:
	- fa
	metrics:
	- perplexity
	tags:
	- kneser-ney
	- n-gram
	- kenlm
	---

	# KenLM models for Farsi

	This repository contains trained KenLM models for Farsi (Persian) language trained on the Jomleh
	dataset. Among all the use cases for the language models like KenLM, the models provided here are
	very useful for ASR (automatic speech recognition) task. They can be used along with CTC to select
	the more likely sequence of tokens extracted from spectogram.

	The models in this repository are KenLM arpa files turned into binary. KenLM supports two types of
	binary formats: probing and trie. The models provided here are of the probing format. KenLM claims
	that they are faster but with bigger memory footprint.

	There are a total 36 different KenLM models that you can find here. Unless you are doing some
	research, you won't be needing all of them. If that's the case, I suggest downloading the ones you
	need and not the whole repository. As the total size of files is larger than half a TB.

	# Sample code how to use the models

	Unfortunately, I could not find an easy way to integrate the Python code that loads the models
	using Huggingface library. These are the steps that you have to take if you want to use any of the
	models provided here:

	1. Install KenLM package:

	```
	pip install https://github.com/kpu/kenlm/archive/master.zip
	```

	2. Install the SentencePiece for the tokenization:
	```
	pip install sentencepiece
	```

	3. Download the model that you are interested in from this repository along the Python code
	`model.py`. Keep the model in the `files` folder with the `model.py` by it (just like the file
	structure in the repository). Don't forget to download the SentencePiece files as well. For
	instance, if you were interested in 32000 vocabulary size tokenizer, 5-gram model with maximum
	pruning, these are the files you'll need:
	```
	model.py
	files/jomleh-sp-32000.model
	files/jomleh-sp-32000.vocab
	files/jomleh-sp-32000-o5-prune01111.probing
	```

	4. Write your own code to instantiate a model and use it:

	```
	```

	# What are the different models provided here

	There a total of 36 models in this repository and while all of the are trained on Jomleh daatset,
	which is a Farsi dataset, there differences among them. Namely:

	1. Different vocabulary sizes: For research purposes, I trained on 6 different vocabulary sizes.
	Of course, the vocabulary size is a hyperparameter for the tokenizer (SentencePiece here) but
	once you have a new tokenizer, it will result in a new model. The different vocabulary sizes used
	here are: 2000, 4000, 8000, 16000, 32000, and 57218 tokens. For most use cases, ethier 32000 or
	57218 token vocabulary size should be the best option.