|
--- |
|
license: unknown |
|
datasets: |
|
- raicrits/YouTube_RAI_dataset |
|
language: |
|
- it |
|
pipeline_tag: text-classification |
|
tags: |
|
- LLM |
|
- Italian |
|
- Classification |
|
- BERT |
|
- Topics |
|
library_name: transformers |
|
--- |
|
|
|
--- |
|
|
|
# Model Card raicrits/BERT_ChangeOfTopic |
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
|
[bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased) finetuned to be capable of detecting |
|
a change of topic in a given text. |
|
|
|
|
|
### Model Description |
|
|
|
|
|
The model is finetuned for the specific task of detecting a change of topic in a given text. Given a text the model answers with "1" in the case that it detects a change of topic and "0" otherwise. |
|
The training has been done using the chapters in the Youtube videos contained in the train split of the dataset [raicrits/YouTube_RAI_dataset](https://huggingface.co/meta-llama/raicrits/YouTube_RAI_dataset). |
|
|
|
|
|
- **Developed by:** Stefano Scotta (stefano.scotta@rai.it) |
|
- **Model type:** LLM finetuned on the specific task of detect a change of topic in a given text |
|
- **Language(s) (NLP):** Italian |
|
- **License:** unknown |
|
- **Finetuned from model [optional]:** [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased) |
|
|
|
|
|
## Uses |
|
|
|
The model can be used to check if in a given text occurs a change of topic or not. |
|
|
|
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. --> |
|
|
|
|
|
## How to Get Started with the Model |
|
|
|
Use the code below to get started with the model. |
|
|
|
**Usage:** |
|
Use the code below to get started with the model. |
|
``` python |
|
|
|
import torch |
|
from transformers import AutoTokenizer, BertForSequenceClassification, BertTokenizer, AutoModelForCausalLM, pipeline |
|
|
|
|
|
model_bert = torch.load('raicrits/BERT_ChangeOfTopic') |
|
model_bert = model_bert.to(device_bert) |
|
|
|
tokenizer_bert = AutoTokenizer.from_pretrained('bert-base-multilingual-cased') |
|
|
|
encoded_dict = tokenizer_bert.encode_plus( |
|
'<text>', |
|
add_special_tokens = True, |
|
max_length = 256, |
|
# max_length = min(max_len, 512), |
|
truncation = True, |
|
padding='max_length', |
|
return_attention_mask = True, |
|
return_tensors = 'pt', |
|
) |
|
input_ids = encoded_dict['input_ids'].to(device_bert) |
|
input_mask = encoded_dict['attention_mask'].to(device_bert) |
|
with torch.no_grad(): |
|
output= model_bert(input_ids, |
|
token_type_ids=None, |
|
attention_mask=input_mask) |
|
logits = output.logits |
|
logits = logits.detach().cpu().numpy() |
|
pred_flat = np.argmax(logits, axis=1).flatten() |
|
print(pred_flat[0]) |
|
``` |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
Chapters in the Youtube videos contained in the train split of the dataset [raicrits/YouTube_RAI_dataset](https://huggingface.co/meta-llama/raicrits/YouTube_RAI_dataset) |
|
|
|
### Training Procedure |
|
|
|
|
|
**Training setting:** |
|
- train epochs=18, |
|
|
|
- learning_rate=2e-05 |
|
|
|
|
|
## Environmental Impact |
|
|
|
<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly --> |
|
|
|
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). |
|
|
|
- **Hardware Type:** 1 NVIDIA A100/40Gb |
|
- **Hours used:** 20 |
|
- **Cloud Provider:** Private Infrastructure |
|
- **Carbon Emitted:** 2.38kg eq. CO2 |
|
|
|
## Model Card Authors |
|
|
|
Stefano Scotta (stefano.scotta@rai.it) |
|
|
|
## Model Card Contact |
|
|
|
stefano.scotta@rai.it |