socialmediaie
/

TRAC2020_ALL_C_bert-base-multilingual-uncased

+# Multilingual Joint Fine-tuning of Transformer models for identifying Trolling, Aggression and Cyberbullying at TRAC 2020
+Models and predictions for submission to TRAC - 2020 Second Workshop on Trolling, Aggression and Cyberbullying
+Our approach is described in our paper titled:
+> Mishra, Sudhanshu, Shivangi Prasad, and Shubhanshu Mishra. 2020. "Multilingual Joint Fine-Tuning of Transformer Models for Identifying Trolling, Aggression and Cyberbullying at TRAC 2020." In Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying (TRAC-2020).
+The source code for training this model and more details can be found on our code repository: https://github.com/socialmediaie/TRAC2020
+NOTE: These models are retrained for uploading here after our submission so the evaluation measures may be slightly different from the ones reported in the paper.
+If you plan to use the dataset please cite the following resources:
+* Mishra, Sudhanshu, Shivangi Prasad, and Shubhanshu Mishra. 2020. "Multilingual Joint Fine-Tuning of Transformer Models for Identifying Trolling, Aggression and Cyberbullying at TRAC 2020." In Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying (TRAC-2020).
+* Mishra, Shubhanshu, Shivangi Prasad, and Shubhanshu Mishra. 2020. “Trained Models for Multilingual Joint Fine-Tuning of Transformer Models for Identifying Trolling, Aggression and Cyberbullying at TRAC 2020.” University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-8882752_V1.
+```
+@inproceedings{Mishra2020TRAC,
+author = {Mishra, Sudhanshu and Prasad, Shivangi and Mishra, Shubhanshu},
+booktitle = {Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying (TRAC-2020)},
+title = {{Multilingual Joint Fine-tuning of Transformer models for identifying Trolling, Aggression and Cyberbullying at TRAC 2020}},
+year = {2020}
+}
+@data{illinoisdatabankIDB-8882752,
+author = {Mishra, Shubhanshu and Prasad, Shivangi and Mishra, Shubhanshu},
+doi = {10.13012/B2IDB-8882752_V1},
+publisher = {University of Illinois at Urbana-Champaign},
+title = {{Trained models for Multilingual Joint Fine-tuning of Transformer models for identifying Trolling, Aggression and Cyberbullying at TRAC 2020}},
+url = {https://doi.org/10.13012/B2IDB-8882752{\_}V1},
+year = {2020}
+}
+```
+## Usage
+The models can be used via the following code:
+```python
+from transformers import AutoModel, AutoTokenizer, AutoModelForSequenceClassification
+import torch
+from pathlib import Path
+from scipy.special import softmax
+import numpy as np
+import pandas as pd
+TASK_LABEL_IDS = {
+    "Sub-task A": ["OAG", "NAG", "CAG"],
+    "Sub-task B": ["GEN", "NGEN"],
+    "Sub-task C": ["OAG-GEN", "OAG-NGEN", "NAG-GEN", "NAG-NGEN", "CAG-GEN", "CAG-NGEN"]
+}
+model_version="databank" # other option is hugging face library
+if model_version == "databank":
+    # Make sure you have downloaded the required model file from https://databank.illinois.edu/datasets/IDB-8882752
+    # Unzip the file at some model_path (we are using: "databank_model")
+    model_path = next(Path("databank_model").glob("./*/output/*/model"))
+    # Assuming you get the following type of structure inside "databank_model"
+    # 'databank_model/ALL/Sub-task C/output/bert-base-multilingual-uncased/model'
+    lang, task, _, base_model, _ = model_path.parts
+    tokenizer = AutoTokenizer.from_pretrained(base_model)
+    model = AutoModelForSequenceClassification.from_pretrained(model_path)
+else:
+    lang, task, base_model = "ALL", "Sub-task C", "bert-base-multilingual-uncased"
+    base_model = f"socialmediaie/{lang}_{lang.split()[-1]}_{base_model}"
+    tokenizer = AutoTokenizer.from_pretrained(base_model)
+    model = AutoModelForSequenceClassification.from_pretrained(base_model)
+# For doing inference set model in eval mode
+model.eval()
+task_labels = TASK_LABEL_IDS[task]
+sentence = "This is a good cat and this is a bad dog."
+processed_sentence = f"{tokenizer.cls_token} {sentence}"
+tokens = tokenizer.tokenize(sentence)
+indexed_tokens = tokenizer.convert_tokens_to_ids(tokens)
+tokens_tensor = torch.tensor([indexed_tokens])
+with torch.no_grad():
+  logits, = model(tokens_tensor, labels=None)
+logits
+preds = logits.detach().cpu().numpy()
+preds_probs = softmax(preds, axis=1)
+preds = np.argmax(preds_probs, axis=1)
+preds_labels = np.array(task_labels)[preds]
+print(dict(zip(task_labels, preds_probs[0])), preds_labels)
+"""You should get an output as follows:
+({'CAG-GEN': 0.06762535,
+  'CAG-NGEN': 0.03244293,
+  'NAG-GEN': 0.6897794,
+  'NAG-NGEN': 0.15498641,
+  'OAG-GEN': 0.034373745,
+  'OAG-NGEN': 0.020792078},
+ array(['NAG-GEN'], dtype='<U8'))
+"""
+```