HealthNewsBRT / README.md
raphaelfontes's picture
Update README.md
d85eea4 verified
metadata
license: apache-2.0
language:
  - bzs
metrics:
  - accuracy
  - f1
pipeline_tag: text-classification
tags:
  - news
  - health
  - classification
model-index:
  - name: raphaelfontes/HealthNewsBRT
    results:
      - task:
          type: text-classification
          name: Text Classification
        metrics:
          - type: accuracy
            value: 0.95
            name: Accuracy
            verified: false
          - type: f1
            value: 0.95
            name: F1 Score
            verified: false

HealthNewsBRT - BERT Classification Model for Brazilian Portuguese News Articles

Introduction

This repository contains a BERT-based classification model for categorizing news articles in Portuguese (pt-br) into two categories: Health News (LABEL_0) and Non-Health News (LABEL_1). This model is designed to help classify news articles and identify whether they pertain to health-related topics or not.

Pretrained Model (BERTimbau)

For this project, we used the BERTimbau model, which is a Portuguese variant of BERT fine-tuned for natural language understanding tasks.

Classification report

Precision Recall F1-Score Support
LABEL_0 0.96 0.95 0.95 14000
LABEL_1 0.95 0.96 0.96 14000
Accuracy 0.95 28000
Macro Avg 0.96 0.95 0.95 28000
Weighted Avg 0.96 0.95 0.95 28000

Dataset

For training and evaluation, we used a dataset consisting of 28,000 labeled news articles in Portuguese. The dataset is divided as follows:

  • 14,000 samples of Health News (LABEL_0): These articles are related to various health topics, such as medical discoveries, healthcare policies, and wellness.
  • 14,000 samples of Non-Health News (LABEL_1): These articles cover a wide range of subjects that do not fall under the health category, including politics, sports, entertainment, and more.

The dataset was collected and preprocessed to ensure consistency and quality in labeling and text formatting.

Data Splitting

To assess the model's performance, we split the dataset into training and testing subsets. We used an 80-20 split, with 80% of the data used for training and 20% for testing. This split helps us evaluate how well the model generalizes to new, unseen data.

Usage

from transformers import BertTokenizer, BertForSequenceClassification
import torch

# Load the pretrained model and tokenizer
tokenizer = BertTokenizer.from_pretrained('raphaelfontes/HealthNewsBRT')
model = BertForSequenceClassification.from_pretrained('raphaelfontes/HealthNewsBRT')

# Define a news article
news_article = "This is a news article in Portuguese about a health-related topic."

# Tokenize and encode the news article
inputs = tokenizer(news_article, return_tensors='pt', padding=True, truncation=True)

# Make predictions
with torch.no_grad():
    outputs = model(**inputs)

# Get predicted label
predicted_label = torch.argmax(outputs.logits).item()

# Map label to human-readable category
if predicted_label:
    category = "Health News"
else:
    category = "Non-Health News"

print(f"The article is categorized as: {category}")