BERTimbau-large-text-filter

BERTimbau-large-text-filter is a BERT model that can be used to score the quality of a given Portuguese text string. This model was trained on the GigaVerbo-Text-Filter dataset.

Details

  • Size: 334,398,466 parameters
  • Dataset: GigaVerbo-Text-Filter
  • Language: Portuguese
  • Number of Training Epochs: 3
  • Batch size: 128
  • Optimizer: torch.optim.AdamW
  • Learning Rate: 4e-5

This repository has the source code used to train this model.

Usage

Here's an example of how to use the BERTimbau-large-text-filter:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import TextClassificationPipeline
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = AutoTokenizer.from_pretrained("TucanoBR/BERTimbau-large-text-filter")
model = AutoModelForSequenceClassification.from_pretrained("TucanoBR/BERTimbau-large-text-filter")
model.to(device)

classifier = TextClassificationPipeline(model=model, tokenizer=tokenizer, device=device)
result = classifier("Os tucanos são aves que correspondem à família Ramphastidae, vivem nas florestas tropicais da América Central e América do Sul. A família inclui cinco gêneros e mais de quarenta espécies diferentes. Possuem bicos notavelmente grandes e coloridos, que possuem a função de termorregulação para as muitas espécies que passam muito tempo na copa da floresta exposta ao sol tropical quente.")

Cite as 🤗

@misc{correa2024tucanoadvancingneuraltext,
      title={{Tucano: Advancing Neural Text Generation for Portuguese}}, 
      author={Corr{\^e}a, Nicholas Kluge and Sen, Aniket and Falk, Sophia and Fatimah, Shiza},
      year={2024},
      eprint={2411.07854},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2411.07854}, 
}

Aknowlegments

We gratefully acknowledge the granted access to the Marvin cluster hosted by University of Bonn along with the support provided by its High Performance Computing & Analytics Lab.

License

BERTimbau-large-text-filter is licensed under the Apache License, Version 2.0. For more details, see the LICENSE file.

Downloads last month
15
Safetensors
Model size
334M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train TucanoBR/BERTimbau-large-text-filter

Collection including TucanoBR/BERTimbau-large-text-filter