BERT_ChangeOfTopic / README.md
stefanoscotta's picture
Update README.md
19629f7 verified
|
raw
history blame
3.74 kB
metadata
license: unknown
datasets:
  - raicrits/YouTube_RAI_dataset
language:
  - it
pipeline_tag: text-classification
tags:
  - LLM
  - Italian
  - Classification
  - BERT
  - Topics
library_name: transformers

Model Card raicrits/BERT_ChangeOfTopic

bert-base-multilingual-cased finetuned to be capable of detecting a change of topic in a given text.

Model Description

The model is finetuned for the specific task of detecting a change of topic in a given text. Given a text the model answers with "1" in the case that it detects a change of topic and "0" otherwise. The training has been done using the chapters in the Youtube videos contained in the train split of the dataset raicrits/YouTube_RAI_dataset.

  • Developed by: Stefano Scotta (stefano.scotta@rai.it)
  • Model type: LLM finetuned on the specific task of detect a change of topic in a given text
  • Language(s) (NLP): Italian
  • License: unknown
  • Finetuned from model [optional]: bert-base-multilingual-cased

Uses

The model can be used to check if in a given text occurs a change of topic or not.

How to Get Started with the Model

Use the code below to get started with the model.

Usage: Use the code below to get started with the model.


import torch
from transformers import AutoTokenizer, BertForSequenceClassification, BertTokenizer, AutoModelForCausalLM, pipeline


model_bert = torch.load('raicrits/BERT_ChangeOfTopic')
model_bert = model_bert.to(device_bert)

tokenizer_bert = AutoTokenizer.from_pretrained('bert-base-multilingual-cased')

encoded_dict = tokenizer_bert.encode_plus(
                   '<text>',                     
                   add_special_tokens = True, 
                   max_length = 256,
                 # max_length = min(max_len, 512),           
                   truncation = True,
                   padding='max_length',
                   return_attention_mask = True,
                   return_tensors = 'pt',
              )
input_ids = encoded_dict['input_ids'].to(device_bert)
input_mask = encoded_dict['attention_mask'].to(device_bert)
with torch.no_grad():        
   output= model_bert(input_ids, 
                          token_type_ids=None, 
                          attention_mask=input_mask)
   logits = output.logits
   logits = logits.detach().cpu().numpy()
   pred_flat = np.argmax(logits, axis=1).flatten()
print(pred_flat[0])

Training Details

Training Data

Chapters in the Youtube videos contained in the train split of the dataset raicrits/YouTube_RAI_dataset

Training Procedure

Training setting:

  • train epochs=18,

  • learning_rate=2e-05

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

  • Hardware Type: 1 NVIDIA A100/40Gb
  • Hours used: 20
  • Cloud Provider: Private Infrastructure
  • Carbon Emitted: 2.38kg eq. CO2

Model Card Authors

Stefano Scotta (stefano.scotta@rai.it)

Model Card Contact

stefano.scotta@rai.it