yanirmr commited on
Commit
500dab3
1 Parent(s): 73db7c2

update source description

Browse files
Files changed (1) hide show
  1. src/about.py +14 -13
src/about.py CHANGED
@@ -44,28 +44,29 @@ The following datasets are used in our evaluation:
44
 
45
  ### [ivrit-ai/eval-d1](https://huggingface.co/datasets/ivrit-ai/eval-d1)
46
  - **Size**: 2 hours
47
- - **Domain**: Manual transcription of podcasts. Typical segment length is 5 minutes.
48
- - **Source**: Description of source
49
 
50
  ### [ivrit-ai/saspeech](https://huggingface.co/datasets/ivrit-ai/saspeech)
51
- - **Size**: X hours
52
- - **Domain**: Description
53
- - **Source**: Description of source
 
54
 
55
  ### [google/fleurs/he](https://huggingface.co/datasets/google/fleurs)
56
  - **Size**: X hours
57
- - **Domain**: Description
58
- - **Source**: Description of source
59
 
60
  ### [mozilla-foundation/common_voice_17_0/he](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0)
61
- - **Size**: X hours
62
- - **Domain**: Description
63
- - **Source**: Description of source
64
 
65
  ### [imvladikon/hebrew_speech_kan](https://huggingface.co/datasets/imvladikon/hebrew_speech_kan)
66
- - **Size**: X hours
67
- - **Domain**: Description
68
- - **Source**: Description of source
69
  """
70
 
71
  # Technical details about evaluation
 
44
 
45
  ### [ivrit-ai/eval-d1](https://huggingface.co/datasets/ivrit-ai/eval-d1)
46
  - **Size**: 2 hours
47
+ - **Domain**: Manual transcription of a single podcast episode featuring an informal conversation between two speakers (male and female). Audio is segmented into approximately 5-minute chunks.
48
+ - **Source**: Part of the ivrit.ai corpus. Selected episode has been manually transcribed to golden standard quality to serve as a high-quality evaluation benchmark.
49
 
50
  ### [ivrit-ai/saspeech](https://huggingface.co/datasets/ivrit-ai/saspeech)
51
+ - **Size**: 4 hours (manually corrected portion of the corpus)
52
+ - **Domain**: Economic and political podcast content, containing both read speech and conversational segments. Segments are several seconds in length.
53
+ - **Source**: Derived from the [Robo-Shaul project](https://www.roboshaul.com/) and published in the paper
54
+ "SASPEECH: A Hebrew Single Speaker Dataset for Text To Speech and Voice Conversion" (Sharoni, O., Shenberg, R., Cooper, E. (2023) SASPEECH: A Hebrew Single Speaker Dataset for Text To Speech and Voice Conversion. Proc. INTERSPEECH 2023,)
55
 
56
  ### [google/fleurs/he](https://huggingface.co/datasets/google/fleurs)
57
  - **Size**: X hours
58
+ - **Domain**: Read speech covering common topics and phrases in Hebrew
59
+ - **Source**: Created as part of Google's FLEURS project, designed for multilingual speech tasks and evaluation. Data collected through crowdsourcing from Hebrew speakers.
60
 
61
  ### [mozilla-foundation/common_voice_17_0/he](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0)
62
+ - **Size**: X hours (test set of the corpus)
63
+ - **Domain**: Read sentences in Hebrew from various texts.
64
+ - **Source**: Collected through Mozilla's Common Voice initiative, where volunteers contribute recordings and validate other speakers' contributions
65
 
66
  ### [imvladikon/hebrew_speech_kan](https://huggingface.co/datasets/imvladikon/hebrew_speech_kan)
67
+ - **Size**: 1.7 hours (validation setof the corpus)
68
+ - **Domain**: Varied content types from the Kan (Israeli Public Broadcasting Corporation) youtube channel
69
+ - **Source**: Published by Vladimir Gurevich. Scraped audio and subtitles data from YouTube channel "כאן" (Kan).
70
  """
71
 
72
  # Technical details about evaluation