SalahZa commited on
Commit
e29c86b
1 Parent(s): d451434

more detailed readme

Browse files
Files changed (1) hide show
  1. README.md +5 -6
README.md CHANGED
@@ -3,7 +3,8 @@
3
  This project aims to create an Automatic Speech Recognition (ASR) model dedicated for the Tunisian Arabic dialect. The goal is to improve speech recognition technology for underrepresented linguistic communities by transcribing Tunisian dialect speech into written text.
4
 
5
  ## Dataset
6
- All the audio and text data collected to train the model have been provided for free to encourage and support research within the community. Please find the paper [here](https://zenodo.org/record/8342762).
 
7
 
8
  ## Performance
9
 
@@ -48,21 +49,19 @@ If you use or refer to this model, please cite :
48
  This ASR model was trained on :
49
  * TARIC : The corpus, named TARIC (Tunisian Arabic Railway Interaction Corpus) has a collection of audio recordings and transcriptions from dialogues in the Tunisian Railway Transport Network. - [Taric Corpus](https://aclanthology.org/L14-1385/) -
50
  * IWSLT : A Tunisian conversational speech - [IWSLT Corpus](https://iwslt.org/2022/dialect)-
51
- * TunSwitch : Our crowd-collected dataset described in the paper presented below.
52
 
53
  ## Demo
54
  Here is a working live demo : [LINK](https://huggingface.co/spaces/SalahZa/Code-Switched-Tunisian-SpeechToText)
55
 
56
 
57
-
58
-
59
  ## Inference
60
 
61
  ### 1. Create a CSV test file
62
- First, you have to create a csv file that follow SpeechBrain's format which contain 4 columns:
63
  * ID: contain ID to identify each audio sample in the dataset
64
  * wav: contain the path to the audio file
65
- * wrd: contain the text transcription of the spoken content in the audio file
66
  * duration: the duration of the audio in seconds
67
 
68
 
 
3
  This project aims to create an Automatic Speech Recognition (ASR) model dedicated for the Tunisian Arabic dialect. The goal is to improve speech recognition technology for underrepresented linguistic communities by transcribing Tunisian dialect speech into written text.
4
 
5
  ## Dataset
6
+ Part of the audio and text data (The ones we collected) used to train and test the model has been provided to encourage and support research within the community. Please find the dataset [here](https://zenodo.org/record/8370566). This Zenodo record contains labeled and unlabeled Tunisian Arabic audio data, along with textual data for language modelling.
7
+ The folder also contains a 4-gram language model trained with KenLM on data released within the Zenodo record. The .arpa file is called "outdomain.arpa".
8
 
9
  ## Performance
10
 
 
49
  This ASR model was trained on :
50
  * TARIC : The corpus, named TARIC (Tunisian Arabic Railway Interaction Corpus) has a collection of audio recordings and transcriptions from dialogues in the Tunisian Railway Transport Network. - [Taric Corpus](https://aclanthology.org/L14-1385/) -
51
  * IWSLT : A Tunisian conversational speech - [IWSLT Corpus](https://iwslt.org/2022/dialect)-
52
+ * TunSwitch : Our crowd-collected dataset described in the paper presented above.
53
 
54
  ## Demo
55
  Here is a working live demo : [LINK](https://huggingface.co/spaces/SalahZa/Code-Switched-Tunisian-SpeechToText)
56
 
57
 
 
 
58
  ## Inference
59
 
60
  ### 1. Create a CSV test file
61
+ First, you have to create a csv file that follows SpeechBrain's format which contain 4 columns:
62
  * ID: contain ID to identify each audio sample in the dataset
63
  * wav: contain the path to the audio file
64
+ * wrd: contain the text transcription of the spoken content in the audio file if you have it and use your set for evaluation. Put anything if you don't have transcriptions. An example is provided in this folder, the file is called : taric_test.csv
65
  * duration: the duration of the audio in seconds
66
 
67