MahmoudAshraf
/

mms-300m-1130-forced-aligner

Automatic Speech Recognition

forced-alignment

Inference Endpoints

Model card Files Files and versions Community

mms-300m-1130-forced-aligner / README.md

MahmoudAshraf's picture

update python usage instructions

4f7a07c verified 7 months ago

|

2.65 kB

	---
	language:
	- ab
	- af
	- ak
	- am
	- ar
	- as
	- av
	- ay
	- az
	- ba
	- bm
	- be
	- bn
	- bi
	- bo
	- sh
	- br
	- bg
	- ca
	- cs
	- ce
	- cv
	- ku
	- cy
	- da
	- de
	- dv
	- dz
	- el
	- en
	- eo
	- et
	- eu
	- ee
	- fo
	- fa
	- fj
	- fi
	- fr
	- fy
	- ff
	- ga
	- gl
	- gn
	- gu
	- zh
	- ht
	- ha
	- he
	- hi
	- sh
	- hu
	- hy
	- ig
	- ia
	- ms
	- is
	- it
	- jv
	- ja
	- kn
	- ka
	- kk
	- kr
	- km
	- ki
	- rw
	- ky
	- ko
	- kv
	- lo
	- la
	- lv
	- ln
	- lt
	- lb
	- lg
	- mh
	- ml
	- mr
	- ms
	- mk
	- mg
	- mt
	- mn
	- mi
	- my
	- zh
	- nl
	- 'no'
	- 'no'
	- ne
	- ny
	- oc
	- om
	- or
	- os
	- pa
	- pl
	- pt
	- ms
	- ps
	- qu
	- qu
	- qu
	- qu
	- qu
	- qu
	- qu
	- qu
	- qu
	- qu
	- qu
	- qu
	- qu
	- qu
	- qu
	- qu
	- qu
	- qu
	- qu
	- qu
	- qu
	- qu
	- ro
	- rn
	- ru
	- sg
	- sk
	- sl
	- sm
	- sn
	- sd
	- so
	- es
	- sq
	- su
	- sv
	- sw
	- ta
	- tt
	- te
	- tg
	- tl
	- th
	- ti
	- ts
	- tr
	- uk
	- ms
	- vi
	- wo
	- xh
	- ms
	- yo
	- ms
	- zu
	- za
	license: cc-by-nc-4.0
	tags:
	- mms
	- wav2vec2
	---

	# Forced Alignment with Hugging Face CTC Models
	This Python package provides an efficient way to perform forced alignment between text and audio using Hugging Face's pretrained models. it also features an improved implementation to use much less memory than TorchAudio forced alignment API.

	The model checkpoint uploaded here is a conversion from torchaudio to HF Transformers for the MMS-300M checkpoint trained on forced alignment dataset

	## Installation

	```bash
	pip install git+https://github.com/MahmoudAshraf97/ctc-forced-aligner.git
	```
	## Usage

	```python
	import torch
	from ctc_forced_aligner import (
	load_audio,
	load_alignment_model,
	generate_emissions,
	preprocess_text,
	get_alignments,
	get_spans,
	postprocess_results,
	)

	audio_path = "your/audio/path"
	text_path = "your/text/path"
	language = "iso" # ISO-639-3 Language code
	device = "cuda" if torch.cuda.is_available() else "cpu"
	batch_size = 16


	alignment_model, alignment_tokenizer, alignment_dictionary = load_alignment_model(
	device,
	dtype=torch.float16 if device == "cuda" else torch.float32,
	)

	audio_waveform = load_audio(audio_path, alignment_model.dtype, alignment_model.device)


	with open(text_path, "r") as f:
	lines = f.readlines()
	text = "".join(line for line in lines).replace("\n", " ").strip()

	emissions, stride = generate_emissions(
	alignment_model, audio_waveform, batch_size=batch_size
	)

	tokens_starred, text_starred = preprocess_text(
	text,
	romanize=True,
	language=language,
	)

	segments, scores, blank_id = get_alignments(
	emissions,
	tokens_starred,
	alignment_dictionary,
	)

	spans = get_spans(tokens_starred, segments, alignment_tokenizer.decode(blank_id))

	word_timestamps = postprocess_results(text_starred, spans, stride, scores)
	```