From Babble to Words
The models, tokenizers and datasets used for our BabyLM 2024 submission. We have eight prediction files (predictions.json.gz) - the best is BPE-TXT.
UpdatedNote Tokenizer trained on BabyLM dataset that uses character-based tokenization for phonemic text. Word boundaries are not removed.
phonemetransformers/BABYLM-TOKENIZER-BPE-PHON
UpdatedNote Tokenizer trained on BabyLM dataset that uses BPE tokenization for phonemic text. Word boundaries are not removed.
phonemetransformers/BABYLM-TOKENIZER-CHAR-TXT
UpdatedNote Tokenizer trained on BabyLM dataset that uses character-based tokenization for orthographic text. Word boundaries are not removed.
phonemetransformers/BABYLM-TOKENIZER-BPE-TXT
UpdatedNote Tokenizer trained on BabyLM dataset that uses BPE tokenization for orthographic text. Word boundaries are not removed.
phonemetransformers/BABYLM-TOKENIZER-BPE-PHON-SPACELESS
UpdatedNote Tokenizer trained on BabyLM dataset that uses BPE tokenization for phonemic text. Word boundaries are removed.
phonemetransformers/BABYLM-TOKENIZER-CHAR-TXT-SPACELESS
UpdatedNote Tokenizer trained on BabyLM dataset that uses character-based tokenization for orthographic text. Word boundaries are removed.
phonemetransformers/BABYLM-TOKENIZER-BPE-TXT-SPACELESS
UpdatedNote Tokenizer trained on BabyLM dataset that uses BPE tokenization for orthographic text. Word boundaries are removed.
phonemetransformers/BABYLM-TOKENIZER-CHAR-PHON-SPACELESS
UpdatedNote Tokenizer trained on BabyLM dataset that uses character-based tokenization for phonemic text. Word boundaries are removed.
phonemetransformers/GPT2-85M-BPE-PHON
Updated • 4Note GPT2 with 85M non-embedding parameters trained using the BPE-PHON tokenizer.
phonemetransformers/GPT2-85M-BPE-PHON-SPACELESS
Updated • 6Note GPT2 with 85M non-embedding parameters trained using the BPE-PHON-SPACELESS tokenizer.
phonemetransformers/GPT2-85M-CHAR-TXT-SPACELESS
Updated • 6Note GPT2 with 85M non-embedding parameters trained using the CHAR-TXT-SPACELESS tokenizer.
phonemetransformers/GPT2-85M-CHAR-PHON
Updated • 6Note GPT2 with 85M non-embedding parameters trained using the CHAR-PHON tokenizer.
phonemetransformers/GPT2-85M-CHAR-PHON-SPACELESS
Updated • 6Note GPT2 with 85M non-embedding parameters trained using the CHAR-PHON-SPACELESS tokenizer.
phonemetransformers/GPT2-85M-CHAR-TXT
Updated • 5Note GPT2 with 85M non-embedding parameters trained using the CHAR-TXT tokenizer.
phonemetransformers/GPT2-85M-BPE-TXT-SPACELESS
Updated • 6Note GPT2 with 85M non-embedding parameters trained using the BPE-TXT-SPACELESS tokenizer.
phonemetransformers/GPT2-85M-BPE-TXT
Updated • 6Note GPT2 with 85M non-embedding parameters trained using the BPE-TXT tokenizer.