Edit model card

bert-base-cased-finetuned

This model is a fine-tuned version of bert-base-cased on the usa_news_en dataset. It achieves the following results on the evaluation set:

  • Loss: 0.0900
  • Accuracy: 0.9800

Model description

The fine-tuned model corresponds to a binary classification model that determines whether two English news headlines are related or not related. In the following paper {News Gathering: Leveraging Transformers to Rank News} it can find more details. To utilize the fine-tuned model, you can follow the steps outlined below:

from transformers import AutoModelForSequenceClassification, AutoTokenizer
from transformers import Trainer

### 1. Load the model:
model_name = "cmunhozc/news-ranking-ft-bert"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

### 2. Dataset:
def preprocess_fctn(examples):
  return tokenizer(examples["sentence1"], examples["sentence2"], truncation=True)
...
encoded_dataset = dataset.map(preprocess_fctn, batched=True, load_from_cache_file=False)
...

### 3. Evaluation:
def compute_metrics(eval_pred):
  predictions, labels = eval_pred
  predictions = np.argmax(predictions, axis=1)

trainer_hf = Trainer(model,
                     eval_dataset    = encoded_dataset['validation'],
                     tokenizer       = tokenizer,
                     compute_metrics = compute_metrics)

trainer_hf.evaluate()

predictions = trainer_hf.predict(encoded_dataset["validation"])
acc_val  = metric.compute(predictions=np.argmax(predictions.predictions,axis=1).tolist(), references=predictions.label_ids)['accuracy']

Finally, with the classification above model, you can follow the steps below to generate the news ranking.

  • For each news article in the google_news_en dataset dataset positioned as the first element in a pair, retrieve all corresponding pairs from the dataset.
  • Employing pair encoders, rank the news articles that occupy the second position in each pair, determining their relevance to the first article.
  • Organize each list generated by the encoders based on the probabilities obtained for the relevance class.

Intended uses & limitations

More information needed

Training, evaluation and test data

The training data is sourced from the train split in usa_news_en dataset, and a similar procedure is applied for the validation set. In the case of testing, the initial segment for the text classification model is derived from the test_1 and test_2 splits. As for the ranking model, the test dataset from google_news_en dataset is utilized

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 1e-05
  • train_batch_size: 32
  • eval_batch_size: 32
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • num_epochs: 3

Training results

Training Loss Epoch Step Validation Loss Accuracy
0.0967 1.0 3526 0.0651 0.9771
0.0439 2.0 7052 0.0820 0.9776
0.0231 3.0 10578 0.0900 0.9800

Framework versions

  • Transformers 4.35.2
  • Pytorch 2.1.0+cu121
  • Datasets 2.16.1
  • Tokenizers 0.15.0
Downloads last month
8
Safetensors
Model size
108M params
Tensor type
F32
·
Inference Examples
Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Finetuned from

Dataset used to train cmunhozc/news-ranking-ft-bert