TUBELEX FastText Word Embeddings

FastText Word Embeddings trained on the TUBELEX YouTube subtitle corpora. We use the 300-dimensional fastText CBOW model with position weights, 10 negative samples, 10 epochs, character 5-grams (other paramters: default) (Grave et al., 2018).

We provide both '*.bin' files (for fastText) and '*.vec' files that follow the common Word2vec format, and can be used for instance with the gensim package.

What is TUBELEX?

TUBELEX is a YouTube subtitle corpus currently available for Chinese, English, Indonesian, Japanese, and Spanish.

@article{nohejl_etal_2024_film,
  title={Beyond {{Film Subtitles}}: {{Is YouTube}} the {{Best Approximation}} of {{Spoken Vocabulary}}?},
  author={Nohejl, Adam and Hudi, Frederikus and Kardinata, Eunike Andriani and Ozaki, Shintaro and Riera Machin, Maria Angelica and Sun, Hongyu and Vasselli, Justin and Watanabe, Taro},
  year={2024}, eprint={2410.03240}, archiveprefix={arXiv}, primaryclass={cs.CL},
  url={https://arxiv.org/abs/2410.03240v1}, journal={ArXiv preprint}, volume={arXiv:2410.03240v1 [cs]}
}

Usage

To download and use the fastText models in Python, first install dependencies:

pip install huggingface_hub
pip install fasttext

You can then use e.g. the English (en) model in the following way:

import fasttext
from huggingface_hub import hf_hub_download

model_file = hf_hub_download(repo_id='naist-nlp/tubelex-kenlm', filename='tubelex-en.bin')
model = fasttext.load_model(model_file)

print(model['koala'])
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.