YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

EsperBERTo Model Card

Model Description

EsperBERTo is a RoBERTa-like model specifically trained from scratch on the Esperanto language using a large corpus from the OSCAR and Leipzig Corpora Collection. It is designed to perform masked language modeling and other text-based prediction tasks. This model is ideal for understanding and generating Esperanto text.

Datasets

  • OSCAR Corpus (Esperanto): Extracted from Common Crawl dumps, filtered by language classification.
  • Leipzig Corpora Collection (Esperanto): Includes texts from news, literature, and Wikipedia.

Preprocessing

  • Trained a byte-level Byte-pair encoding tokenizer with a vocabulary size of 52,000 tokens.

Hyperparameters

  • Number of Epochs: 1
  • Batch Size per GPU: 64
  • Training Steps for Saving: 10,000
  • Limit of Saved Models: 2
  • Loss Calculation: Prediction loss only

Software and Libraries

  • Transformers Library Version: Transformers
  • Training Script: run_language_modeling.py
from transformers import pipeline

fill_mask = pipeline(
    "fill-mask",
    model="SamJoshua/EsperBERTo",
    tokenizer="SamJoshua/EsperBERTo"
)

fill_mask("Jen la komenco de bela <mask>.")

Evaluation Results

The model has not yet been evaluated on a standardized test set. Future updates will include evaluation metrics such as perplexity and accuracy on a held-out validation set.

Intended Uses & Limitations

Intended Uses: This model is intended for researchers, developers, and language enthusiasts who wish to explore Esperanto language processing for tasks like text generation, sentiment analysis, and more.

Limitations:

  • The model is trained only for one epoch due to computational constraints, which may affect its understanding of more complex language structures.
  • As the model is trained on public web text, it may inadvertently learn and replicate social biases present in the training data.

Feel free to contribute to the model by fine-tuning on specific tasks or extending its training with more data or epochs. This model serves as a baseline for further research and development in Esperanto language modeling.

Downloads last month
13
Safetensors
Model size
83.5M params
Tensor type
F32
ยท
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Space using SamJoshua/EsperBERTo 1