erikbranmarino commited on 23 days ago

Commit

b47e2cf

verified ·

1 Parent(s): df13e93

Upload folder using huggingface_hub

Browse files

Files changed (24) hide show

README.md +167 -0
checkpoint-100/config.json +27 -0
checkpoint-100/model.safetensors +3 -0
checkpoint-100/optimizer.pt +3 -0
checkpoint-100/rng_state.pth +3 -0
checkpoint-100/scheduler.pt +3 -0
checkpoint-100/trainer_state.json +59 -0
checkpoint-100/training_args.bin +3 -0
checkpoint-200/config.json +27 -0
checkpoint-200/model.safetensors +3 -0
checkpoint-200/optimizer.pt +3 -0
checkpoint-200/rng_state.pth +3 -0
checkpoint-200/scheduler.pt +3 -0
checkpoint-200/trainer_state.json +85 -0
checkpoint-200/training_args.bin +3 -0
config.json +27 -0
final_metrics.txt +4 -0
model.safetensors +3 -0
special_tokens_map.json +7 -0
tokenizer.json +0 -0
tokenizer_config.json +59 -0
training_args.bin +3 -0
upload-script.py +99 -0
vocab.txt +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,167 @@

+---
+language: multilingual
+tags:
+- conspiracy-detection
+- content-moderation
+- bert
+- prct
+- social-media
+license: mit
+datasets:
+- custom
+metrics:
+- accuracy
+- f1
+- precision
+- recall
+---
+# CT-BERT-PRCT
+## Model description
+CT-BERT-PRCT is a fine-tuned version of CT-BERT specifically adapted for detecting Population Replacement Conspiracy Theory (PRCT) content across social media platforms. The model has been trained to identify both explicit and implicit PRCT narratives while maintaining robust cross-platform generalization capabilities.
+## Intended uses & limitations
+### Intended uses
+- Content moderation for social media platforms
+- Research on conspiracy theory propagation
+- Cross-platform conspiracy content detection
+- Multilingual PRCT detection
+### Limitations
+- Performance may vary across different social media platforms
+- May require periodic fine-tuning to adapt to evolving narratives
+- Should be used as part of a broader content moderation strategy
+- Best performance on YouTube content, with some performance degradation on other platforms
+## Training and evaluation data
+The model was fine-tuned on a dataset of 56,085 YouTube comments and evaluated using:
+- A manually annotated gold standard of 500 YouTube comments
+- A cross-platform test set of 160 Telegram messages in multiple languages (Spanish and Portuguese)
+## Training procedure
+The model was fine-tuned using:
+- Learning rate: 2e-5
+- Batch size: 32
+- Maximum epochs: 6
+- Early stopping based on validation performance
+- Base model: CT-BERT (pre-trained on COVID-19 conspiracy content)
+## Results
+Detailed performance metrics:
+### YouTube Dataset
+- Accuracy: 83.8%
+- Precision: 86.5%
+- Recall: 83.3%
+- F1-score: 83.3%
+### Telegram Dataset (Cross-platform and multilingual)
+- Accuracy: 71.9%
+- Precision: 74.2%
+- Recall: 71.9%
+- F1-score: 71.2%
+The model demonstrates strong performance on its primary training domain (YouTube - English) while maintaining reasonable effectiveness in cross-platform and multilingual scenarios (Telegram - Portuguese and Spanish), showing good generalization capabilities across different social media environments.
+## Example Usage
+```python
+from transformers import AutoModelForSequenceClassification, AutoTokenizer
+import torch
+# Load model and tokenizer
+tokenizer = AutoTokenizer.from_pretrained("erikbranmarino/CT-BERT-PRCT")
+model = AutoModelForSequenceClassification.from_pretrained("erikbranmarino/CT-BERT-PRCT")
+# Prepare your text
+text = "Your text here"
+inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
+# Make prediction
+with torch.no_grad():
+    outputs = model(**inputs)
+    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
+# Get predicted class (0: Non-PRCT, 1: PRCT)
+predicted_class = predictions.argmax().item()
+confidence = predictions[0][predicted_class].item()
+print(f"Class: {'PRCT' if predicted_class == 1 else 'Non-PRCT'}")
+print(f"Confidence: {confidence:.2f}")
+```
+## Complete Example with Batch Processing
+```python
+import torch
+from transformers import AutoModelForSequenceClassification, AutoTokenizer
+from torch.utils.data import Dataset, DataLoader
+class TextDataset(Dataset):
+    def __init__(self, texts, tokenizer, max_length=512):
+        self.encodings = tokenizer(texts, truncation=True, padding=True, max_length=max_length)
+    def __getitem__(self, idx):
+        return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
+    def __len__(self):
+        return len(self.encodings.input_ids)
+def predict_batch(texts, model, tokenizer, batch_size=16):
+    # Prepare dataset and dataloader
+    dataset = TextDataset(texts, tokenizer)
+    dataloader = DataLoader(dataset, batch_size=batch_size)
+    predictions = []
+    model.eval()
+    with torch.no_grad():
+        for batch in dataloader:
+            outputs = model(**batch)
+            probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
+            predictions.extend(probs.cpu().numpy())
+    return predictions
+# Example usage
+texts = ["text1", "text2", "text3"]  # Your list of texts
+results = predict_batch(texts, model, tokenizer)
+for text, pred in zip(texts, results):
+    predicted_class = pred.argmax()
+    confidence = pred[predicted_class]
+    print(f"Text: {text[:50]}...")
+    print(f"Class: {'PRCT' if predicted_class == 1 else 'Non-PRCT'}")
+    print(f"Confidence: {confidence:.2f}\n")
+```
+## Bias and limitations
+This model is intended for research and content moderation purposes. It should be used as part of a broader content moderation strategy and not as a sole decision-maker for content removal. The model may exhibit:
+- Platform-specific biases due to training data source
+- Language-specific performance variations
+- Sensitivity to evolving conspiracy narratives
+## Citation
+If you use this model, please cite:
+```
+@article{marino2024one,
+  title={One Model to Detect Them All? Comparing LLMs, BERT and Traditional ML in Cross-Platform Conspiracy Detection},
+  author={Marino, Erik Bran and Vieira, Renata and Bassi, Davide and Ribeiro, Ana Sofia and Baleato, Suso},
+  year={2024}
+}
+```
+## Contact
+Erik Bran Marino (erik.marino@uevora.pt)

checkpoint-100/config.json ADDED Viewed

	@@ -0,0 +1,27 @@

+{
+  "_name_or_path": "digitalepidemiologylab/covid-twitter-bert-v2",
+  "architectures": [
+    "BertForSequenceClassification"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "classifier_dropout": null,
+  "gradient_checkpointing": false,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 1024,
+  "initializer_range": 0.02,
+  "intermediate_size": 4096,
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_type": "bert",
+  "num_attention_heads": 16,
+  "num_hidden_layers": 24,
+  "pad_token_id": 0,
+  "position_embedding_type": "absolute",
+  "problem_type": "single_label_classification",
+  "torch_dtype": "float32",
+  "transformers_version": "4.48.2",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "vocab_size": 30522
+}

checkpoint-100/model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b4e154db42681ad266530b961dcfeafaf1fb51bafb7f4c25376a578319eef6d9
+size 1340622760

checkpoint-100/optimizer.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:53c8bd7c7688aecb7947fc7f1b1dc6e5884393163427eb8a8a1eea26ab91d46e
+size 2681469421

checkpoint-100/rng_state.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0a52e1687874240dc2f195388b3a14e21795a08a2e06360226802394635c650d
+size 13990

checkpoint-100/scheduler.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:cffdc9546dda6f679ba0f40ea6b1378e5e22621f1578e0603b888c9a5ea82408
+size 1064

checkpoint-100/trainer_state.json ADDED Viewed

	@@ -0,0 +1,59 @@

+{
+  "best_metric": 0.9056603773584906,
+  "best_model_checkpoint": "./ct-bert-finetuned-20250131_120923/checkpoint-100",
+  "epoch": 2.5,
+  "eval_steps": 100,
+  "global_step": 100,
+  "is_hyper_param_search": false,
+  "is_local_process_zero": true,
+  "is_world_process_zero": true,
+  "log_history": [
+    {
+      "epoch": 1.25,
+      "grad_norm": 8.891263008117676,
+      "learning_rate": 1.6666666666666667e-05,
+      "loss": 0.7518,
+      "step": 50
+    },
+    {
+      "epoch": 2.5,
+      "grad_norm": 2.668881893157959,
+      "learning_rate": 1.1111111111111113e-05,
+      "loss": 0.4192,
+      "step": 100
+    },
+    {
+      "epoch": 2.5,
+      "eval_accuracy": 0.90625,
+      "eval_f1": 0.9056603773584906,
+      "eval_loss": 0.28013086318969727,
+      "eval_precision": 0.9113924050632911,
+      "eval_recall": 0.9,
+      "eval_runtime": 7.1504,
+      "eval_samples_per_second": 89.505,
+      "eval_steps_per_second": 2.797,
+      "step": 100
+    }
+  ],
+  "logging_steps": 50,
+  "max_steps": 200,
+  "num_input_tokens_seen": 0,
+  "num_train_epochs": 5,
+  "save_steps": 100,
+  "stateful_callbacks": {
+    "TrainerControl": {
+      "args": {
+        "should_epoch_stop": false,
+        "should_evaluate": false,
+        "should_log": false,
+        "should_save": true,
+        "should_training_stop": false
+      },
+      "attributes": {}
+    }
+  },
+  "total_flos": 1490158249961472.0,
+  "train_batch_size": 16,
+  "trial_name": null,
+  "trial_params": null
+}

checkpoint-100/training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:05473e8cb95480e7b8c3824554730bb4dc46f837ec677135d77b6cb6e30f65e1
+size 5304

checkpoint-200/config.json ADDED Viewed

	@@ -0,0 +1,27 @@

+{
+  "_name_or_path": "digitalepidemiologylab/covid-twitter-bert-v2",
+  "architectures": [
+    "BertForSequenceClassification"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "classifier_dropout": null,
+  "gradient_checkpointing": false,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 1024,
+  "initializer_range": 0.02,
+  "intermediate_size": 4096,
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_type": "bert",
+  "num_attention_heads": 16,
+  "num_hidden_layers": 24,
+  "pad_token_id": 0,
+  "position_embedding_type": "absolute",
+  "problem_type": "single_label_classification",
+  "torch_dtype": "float32",
+  "transformers_version": "4.48.2",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "vocab_size": 30522
+}

checkpoint-200/model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:74310a36761f8d0b460e4f65110fdccff43327b32147ed43806cfb4da00c1f0d
+size 1340622760

checkpoint-200/optimizer.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:205f183294405574d4a34acee4f15a9c122c88d037ed205829fd31233387a01b
+size 2681469421

checkpoint-200/rng_state.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:313197ad728be17c0e8901a451d547f8b4bb02ccc62f005a6369fd5fcbb66fad
+size 13990

checkpoint-200/scheduler.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a3c4f2daa2ec8a75526e136b8d019c39aa6267dcee395737b739960392e13892
+size 1064

checkpoint-200/trainer_state.json ADDED Viewed

	@@ -0,0 +1,85 @@

+{
+  "best_metric": 0.9198184568835098,
+  "best_model_checkpoint": "./ct-bert-finetuned-20250131_120923/checkpoint-200",
+  "epoch": 5.0,
+  "eval_steps": 100,
+  "global_step": 200,
+  "is_hyper_param_search": false,
+  "is_local_process_zero": true,
+  "is_world_process_zero": true,
+  "log_history": [
+    {
+      "epoch": 1.25,
+      "grad_norm": 8.891263008117676,
+      "learning_rate": 1.6666666666666667e-05,
+      "loss": 0.7518,
+      "step": 50
+    },
+    {
+      "epoch": 2.5,
+      "grad_norm": 2.668881893157959,
+      "learning_rate": 1.1111111111111113e-05,
+      "loss": 0.4192,
+      "step": 100
+    },
+    {
+      "epoch": 2.5,
+      "eval_accuracy": 0.90625,
+      "eval_f1": 0.9056603773584906,
+      "eval_loss": 0.28013086318969727,
+      "eval_precision": 0.9113924050632911,
+      "eval_recall": 0.9,
+      "eval_runtime": 7.1504,
+      "eval_samples_per_second": 89.505,
+      "eval_steps_per_second": 2.797,
+      "step": 100
+    },
+    {
+      "epoch": 3.75,
+      "grad_norm": 3.1482038497924805,
+      "learning_rate": 5.555555555555557e-06,
+      "loss": 0.2009,
+      "step": 150
+    },
+    {
+      "epoch": 5.0,
+      "grad_norm": 3.4415204524993896,
+      "learning_rate": 0.0,
+      "loss": 0.0996,
+      "step": 200
+    },
+    {
+      "epoch": 5.0,
+      "eval_accuracy": 0.9171875,
+      "eval_f1": 0.9198184568835098,
+      "eval_loss": 0.2471379041671753,
+      "eval_precision": 0.8914956011730205,
+      "eval_recall": 0.95,
+      "eval_runtime": 7.131,
+      "eval_samples_per_second": 89.75,
+      "eval_steps_per_second": 2.805,
+      "step": 200
+    }
+  ],
+  "logging_steps": 50,
+  "max_steps": 200,
+  "num_input_tokens_seen": 0,
+  "num_train_epochs": 5,
+  "save_steps": 100,
+  "stateful_callbacks": {
+    "TrainerControl": {
+      "args": {
+        "should_epoch_stop": false,
+        "should_evaluate": false,
+        "should_log": false,
+        "should_save": true,
+        "should_training_stop": true
+      },
+      "attributes": {}
+    }
+  },
+  "total_flos": 2979850534241280.0,
+  "train_batch_size": 16,
+  "trial_name": null,
+  "trial_params": null
+}

checkpoint-200/training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:05473e8cb95480e7b8c3824554730bb4dc46f837ec677135d77b6cb6e30f65e1
+size 5304

config.json ADDED Viewed

	@@ -0,0 +1,27 @@

+{
+  "_name_or_path": "digitalepidemiologylab/covid-twitter-bert-v2",
+  "architectures": [
+    "BertForSequenceClassification"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "classifier_dropout": null,
+  "gradient_checkpointing": false,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 1024,
+  "initializer_range": 0.02,
+  "intermediate_size": 4096,
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_type": "bert",
+  "num_attention_heads": 16,
+  "num_hidden_layers": 24,
+  "pad_token_id": 0,
+  "position_embedding_type": "absolute",
+  "problem_type": "single_label_classification",
+  "torch_dtype": "float32",
+  "transformers_version": "4.48.2",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "vocab_size": 30522
+}

final_metrics.txt ADDED Viewed

	@@ -0,0 +1,4 @@

+test_accuracy: 0.8380
+test_precision: 0.7747
+test_recall: 0.9691
+test_f1: 0.8611

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:74310a36761f8d0b460e4f65110fdccff43327b32147ed43806cfb4da00c1f0d
+size 1340622760

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "cls_token": "[CLS]",
+  "mask_token": "[MASK]",
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "unk_token": "[UNK]"
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,59 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "100": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "101": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "102": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "103": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "[CLS]",
+  "do_basic_tokenize": true,
+  "do_lower_case": true,
+  "extra_special_tokens": {},
+  "full_tokenizer_file": null,
+  "mask_token": "[MASK]",
+  "model_max_length": 1000000000000000019884624838656,
+  "never_split": null,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
+  "unk_token": "[UNK]"
+}

training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:05473e8cb95480e7b8c3824554730bb4dc46f837ec677135d77b6cb6e30f65e1
+size 5304

upload-script.py ADDED Viewed

	@@ -0,0 +1,99 @@

+import os
+from huggingface_hub import HfApi, create_repo
+from pathlib import Path
+import sys
+def check_files_exist(model_path):
+    """Check if all necessary files exist in the model directory."""
+    required_files = [
+        "config.json",
+        "model.safetensors",
+        "special_tokens_map.json",
+        "tokenizer.json",
+        "tokenizer_config.json",
+        "training_args.bin",
+        "vocab.txt",
+        "README.md"
+    ]
+    missing_files = []
+    for file in required_files:
+        if not os.path.exists(os.path.join(model_path, file)):
+            missing_files.append(file)
+    return missing_files
+def upload_model_to_hf(model_path, repo_name, organization=None):
+    """
+    Upload a model to Hugging Face Hub.
+    Args:
+        model_path (str): Path to the model directory
+        repo_name (str): Name for the repository on Hugging Face
+        organization (str, optional): Organization name if uploading to an organization
+    """
+    try:
+        # Initialize the API
+        api = HfApi()
+        # Check if all required files exist
+        missing_files = check_files_exist(model_path)
+        if missing_files:
+            print(f"Error: Missing required files: {', '.join(missing_files)}")
+            return False
+        # Create full repository name
+        if organization:
+            full_repo_name = f"{organization}/{repo_name}"
+        else:
+            full_repo_name = f"{api.whoami()['name']}/{repo_name}"
+        print(f"Creating repository: {full_repo_name}")
+        # Create the repository
+        try:
+            create_repo(
+                repo_id=full_repo_name,
+                private=False,
+                exist_ok=True
+            )
+        except Exception as e:
+            print(f"Error creating repository: {str(e)}")
+            return False
+        print("Repository created successfully!")
+        # Upload the model files
+        print(f"Uploading files from {model_path}")
+        api.upload_folder(
+            folder_path=model_path,
+            repo_id=full_repo_name,
+            repo_type="model"
+        )
+        print("Upload completed successfully!")
+        print(f"Your model is now available at: https://huggingface.co/{full_repo_name}")
+        return True
+    except Exception as e:
+        print(f"An error occurred: {str(e)}")
+        return False
+if __name__ == "__main__":
+    # Configurazione
+    MODEL_PATH = "/Users/erikbranmarino/BERT-PRCT-fine-tuning/ct-bert-finetuned-20250131_120923"  # Il path al tuo modello
+    REPO_NAME = "CT-BERT-PRCT"  # Il nome che vuoi dare al repository
+    # Verifica che il path esista
+    if not os.path.exists(MODEL_PATH):
+        print(f"Error: Model path {MODEL_PATH} does not exist!")
+        sys.exit(1)
+    # Esegui l'upload
+    success = upload_model_to_hf(MODEL_PATH, REPO_NAME)
+    if not success:
+        sys.exit(1)

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff