BERT_ChangeOfTopic / README.md
stefanoscotta's picture
Update README.md
19629f7 verified
|
raw
history blame
3.74 kB
---
license: unknown
datasets:
- raicrits/YouTube_RAI_dataset
language:
- it
pipeline_tag: text-classification
tags:
- LLM
- Italian
- Classification
- BERT
- Topics
library_name: transformers
---
---
# Model Card raicrits/BERT_ChangeOfTopic
<!-- Provide a quick summary of what the model is/does. -->
[bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased) finetuned to be capable of detecting
a change of topic in a given text.
### Model Description
The model is finetuned for the specific task of detecting a change of topic in a given text. Given a text the model answers with "1" in the case that it detects a change of topic and "0" otherwise.
The training has been done using the chapters in the Youtube videos contained in the train split of the dataset [raicrits/YouTube_RAI_dataset](https://huggingface.co/meta-llama/raicrits/YouTube_RAI_dataset).
- **Developed by:** Stefano Scotta (stefano.scotta@rai.it)
- **Model type:** LLM finetuned on the specific task of detect a change of topic in a given text
- **Language(s) (NLP):** Italian
- **License:** unknown
- **Finetuned from model [optional]:** [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased)
## Uses
The model can be used to check if in a given text occurs a change of topic or not.
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
## How to Get Started with the Model
Use the code below to get started with the model.
**Usage:**
Use the code below to get started with the model.
``` python
import torch
from transformers import AutoTokenizer, BertForSequenceClassification, BertTokenizer, AutoModelForCausalLM, pipeline
model_bert = torch.load('raicrits/BERT_ChangeOfTopic')
model_bert = model_bert.to(device_bert)
tokenizer_bert = AutoTokenizer.from_pretrained('bert-base-multilingual-cased')
encoded_dict = tokenizer_bert.encode_plus(
'<text>',
add_special_tokens = True,
max_length = 256,
# max_length = min(max_len, 512),
truncation = True,
padding='max_length',
return_attention_mask = True,
return_tensors = 'pt',
)
input_ids = encoded_dict['input_ids'].to(device_bert)
input_mask = encoded_dict['attention_mask'].to(device_bert)
with torch.no_grad():
output= model_bert(input_ids,
token_type_ids=None,
attention_mask=input_mask)
logits = output.logits
logits = logits.detach().cpu().numpy()
pred_flat = np.argmax(logits, axis=1).flatten()
print(pred_flat[0])
```
## Training Details
### Training Data
Chapters in the Youtube videos contained in the train split of the dataset [raicrits/YouTube_RAI_dataset](https://huggingface.co/meta-llama/raicrits/YouTube_RAI_dataset)
### Training Procedure
**Training setting:**
- train epochs=18,
- learning_rate=2e-05
## Environmental Impact
<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
- **Hardware Type:** 1 NVIDIA A100/40Gb
- **Hours used:** 20
- **Cloud Provider:** Private Infrastructure
- **Carbon Emitted:** 2.38kg eq. CO2
## Model Card Authors
Stefano Scotta (stefano.scotta@rai.it)
## Model Card Contact
stefano.scotta@rai.it