Spaces:

ivrit-ai
/

hebrew-transcription-leaderboard

Running

App Files Files Community

yanirmr commited on 15 days ago

Commit

500dab3

•

1 Parent(s): 73db7c2

update source description

Browse files

Files changed (1) hide show

src/about.py +14 -13

src/about.py CHANGED Viewed

@@ -44,28 +44,29 @@ The following datasets are used in our evaluation:
 ### [ivrit-ai/eval-d1](https://huggingface.co/datasets/ivrit-ai/eval-d1)
 - **Size**: 2 hours
-- **Domain**: Manual transcription of podcasts. Typical segment length is 5 minutes.
-- **Source**: Description of source
 ### [ivrit-ai/saspeech](https://huggingface.co/datasets/ivrit-ai/saspeech)
-- **Size**: X hours
-- **Domain**: Description
-- **Source**: Description of source
 ### [google/fleurs/he](https://huggingface.co/datasets/google/fleurs)
 - **Size**: X hours
-- **Domain**: Description
-- **Source**: Description of source
 ### [mozilla-foundation/common_voice_17_0/he](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0)
-- **Size**: X hours
-- **Domain**: Description
-- **Source**: Description of source
 ### [imvladikon/hebrew_speech_kan](https://huggingface.co/datasets/imvladikon/hebrew_speech_kan)
-- **Size**: X hours
-- **Domain**: Description
-- **Source**: Description of source
 """
 # Technical details about evaluation

 ### [ivrit-ai/eval-d1](https://huggingface.co/datasets/ivrit-ai/eval-d1)
 - **Size**: 2 hours
+- **Domain**: Manual transcription of a single podcast episode featuring an informal conversation between two speakers (male and female). Audio is segmented into approximately 5-minute chunks.
+- **Source**: Part of the ivrit.ai corpus. Selected episode has been manually transcribed to golden standard quality to serve as a high-quality evaluation benchmark.
 ### [ivrit-ai/saspeech](https://huggingface.co/datasets/ivrit-ai/saspeech)
+- **Size**: 4 hours (manually corrected portion of the corpus)
+- **Domain**: Economic and political podcast content, containing both read speech and conversational segments. Segments are several seconds in length.
+- **Source**: Derived from the [Robo-Shaul project](https://www.roboshaul.com/) and published in the paper
+"SASPEECH: A Hebrew Single Speaker Dataset for Text To Speech and Voice Conversion" (Sharoni, O., Shenberg, R., Cooper, E. (2023) SASPEECH: A Hebrew Single Speaker Dataset for Text To Speech and Voice Conversion. Proc. INTERSPEECH 2023,)
 ### [google/fleurs/he](https://huggingface.co/datasets/google/fleurs)
 - **Size**: X hours
+- **Domain**: Read speech covering common topics and phrases in Hebrew
+- **Source**: Created as part of Google's FLEURS project, designed for multilingual speech tasks and evaluation. Data collected through crowdsourcing from Hebrew speakers.
 ### [mozilla-foundation/common_voice_17_0/he](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0)
+- **Size**: X hours (test set of the corpus)
+- **Domain**: Read sentences in Hebrew from various texts.
+- **Source**: Collected through Mozilla's Common Voice initiative, where volunteers contribute recordings and validate other speakers' contributions
 ### [imvladikon/hebrew_speech_kan](https://huggingface.co/datasets/imvladikon/hebrew_speech_kan)
+- **Size**: 1.7 hours (validation setof the corpus)
+- **Domain**: Varied content types from the Kan (Israeli Public Broadcasting Corporation) youtube channel
+- **Source**: Published by Vladimir Gurevich. Scraped audio and subtitles data from YouTube channel "כאן" (Kan).
 """
 # Technical details about evaluation