The models, tokenizers and datasets used for our submission for BabyLM 2024, investigating the viability of training LLMs on phoneme streams.
Language Modelling with Phonemes
AI & ML interests
Child language acquisition, CHILDES, word segmentation, phonemes, BabyLM
Recent Activity
View all activity
Collections
1
spaces
1
models
104
phonemetransformers/childes-segmentation-800k-gpt2_lm-model
Text Generation
•
Updated
•
12
phonemetransformers/childes-multilingual-5M-gpt2_lm-model
Text Generation
•
Updated
•
111
phonemetransformers/CHILDES-phoneme-tokenizer
Updated
phonemetransformers/CHILDES-Cantonese-phoneme-tokenizer
Updated
phonemetransformers/CHILDES-Mandarin-phoneme-tokenizer
Updated
phonemetransformers/debug2-gpt2_lm-model
Text Generation
•
Updated
•
17
phonemetransformers/debug-gpt2_lm-model
Text Generation
•
Updated
•
12
phonemetransformers/childes-size-english-gpt2_lm-model
Updated
•
77
phonemetransformers/CHILDES-Polish-phoneme-tokenizer
Updated
phonemetransformers/CHILDES-Serbian-phoneme-tokenizer
Updated