erikbranmarino commited on
Commit
b47e2cf
·
verified ·
1 Parent(s): df13e93

Upload folder using huggingface_hub

Browse files
README.md ADDED
@@ -0,0 +1,167 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: multilingual
3
+ tags:
4
+ - conspiracy-detection
5
+ - content-moderation
6
+ - bert
7
+ - prct
8
+ - social-media
9
+ license: mit
10
+ datasets:
11
+ - custom
12
+ metrics:
13
+ - accuracy
14
+ - f1
15
+ - precision
16
+ - recall
17
+ ---
18
+
19
+ # CT-BERT-PRCT
20
+
21
+ ## Model description
22
+
23
+ CT-BERT-PRCT is a fine-tuned version of CT-BERT specifically adapted for detecting Population Replacement Conspiracy Theory (PRCT) content across social media platforms. The model has been trained to identify both explicit and implicit PRCT narratives while maintaining robust cross-platform generalization capabilities.
24
+
25
+ ## Intended uses & limitations
26
+
27
+ ### Intended uses
28
+
29
+ - Content moderation for social media platforms
30
+ - Research on conspiracy theory propagation
31
+ - Cross-platform conspiracy content detection
32
+ - Multilingual PRCT detection
33
+
34
+ ### Limitations
35
+
36
+ - Performance may vary across different social media platforms
37
+ - May require periodic fine-tuning to adapt to evolving narratives
38
+ - Should be used as part of a broader content moderation strategy
39
+ - Best performance on YouTube content, with some performance degradation on other platforms
40
+
41
+ ## Training and evaluation data
42
+
43
+ The model was fine-tuned on a dataset of 56,085 YouTube comments and evaluated using:
44
+ - A manually annotated gold standard of 500 YouTube comments
45
+ - A cross-platform test set of 160 Telegram messages in multiple languages (Spanish and Portuguese)
46
+
47
+ ## Training procedure
48
+
49
+ The model was fine-tuned using:
50
+ - Learning rate: 2e-5
51
+ - Batch size: 32
52
+ - Maximum epochs: 6
53
+ - Early stopping based on validation performance
54
+ - Base model: CT-BERT (pre-trained on COVID-19 conspiracy content)
55
+
56
+ ## Results
57
+
58
+ Detailed performance metrics:
59
+
60
+ ### YouTube Dataset
61
+ - Accuracy: 83.8%
62
+ - Precision: 86.5%
63
+ - Recall: 83.3%
64
+ - F1-score: 83.3%
65
+
66
+ ### Telegram Dataset (Cross-platform and multilingual)
67
+ - Accuracy: 71.9%
68
+ - Precision: 74.2%
69
+ - Recall: 71.9%
70
+ - F1-score: 71.2%
71
+
72
+ The model demonstrates strong performance on its primary training domain (YouTube - English) while maintaining reasonable effectiveness in cross-platform and multilingual scenarios (Telegram - Portuguese and Spanish), showing good generalization capabilities across different social media environments.
73
+
74
+ ## Example Usage
75
+
76
+ ```python
77
+ from transformers import AutoModelForSequenceClassification, AutoTokenizer
78
+ import torch
79
+
80
+ # Load model and tokenizer
81
+ tokenizer = AutoTokenizer.from_pretrained("erikbranmarino/CT-BERT-PRCT")
82
+ model = AutoModelForSequenceClassification.from_pretrained("erikbranmarino/CT-BERT-PRCT")
83
+
84
+ # Prepare your text
85
+ text = "Your text here"
86
+ inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
87
+
88
+ # Make prediction
89
+ with torch.no_grad():
90
+ outputs = model(**inputs)
91
+ predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
92
+
93
+ # Get predicted class (0: Non-PRCT, 1: PRCT)
94
+ predicted_class = predictions.argmax().item()
95
+ confidence = predictions[0][predicted_class].item()
96
+
97
+ print(f"Class: {'PRCT' if predicted_class == 1 else 'Non-PRCT'}")
98
+ print(f"Confidence: {confidence:.2f}")
99
+ ```
100
+
101
+ ## Complete Example with Batch Processing
102
+
103
+ ```python
104
+ import torch
105
+ from transformers import AutoModelForSequenceClassification, AutoTokenizer
106
+ from torch.utils.data import Dataset, DataLoader
107
+
108
+ class TextDataset(Dataset):
109
+ def __init__(self, texts, tokenizer, max_length=512):
110
+ self.encodings = tokenizer(texts, truncation=True, padding=True, max_length=max_length)
111
+
112
+ def __getitem__(self, idx):
113
+ return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
114
+
115
+ def __len__(self):
116
+ return len(self.encodings.input_ids)
117
+
118
+ def predict_batch(texts, model, tokenizer, batch_size=16):
119
+ # Prepare dataset and dataloader
120
+ dataset = TextDataset(texts, tokenizer)
121
+ dataloader = DataLoader(dataset, batch_size=batch_size)
122
+
123
+ predictions = []
124
+ model.eval()
125
+
126
+ with torch.no_grad():
127
+ for batch in dataloader:
128
+ outputs = model(**batch)
129
+ probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
130
+ predictions.extend(probs.cpu().numpy())
131
+
132
+ return predictions
133
+
134
+ # Example usage
135
+ texts = ["text1", "text2", "text3"] # Your list of texts
136
+ results = predict_batch(texts, model, tokenizer)
137
+
138
+ for text, pred in zip(texts, results):
139
+ predicted_class = pred.argmax()
140
+ confidence = pred[predicted_class]
141
+ print(f"Text: {text[:50]}...")
142
+ print(f"Class: {'PRCT' if predicted_class == 1 else 'Non-PRCT'}")
143
+ print(f"Confidence: {confidence:.2f}\n")
144
+ ```
145
+
146
+ ## Bias and limitations
147
+
148
+ This model is intended for research and content moderation purposes. It should be used as part of a broader content moderation strategy and not as a sole decision-maker for content removal. The model may exhibit:
149
+ - Platform-specific biases due to training data source
150
+ - Language-specific performance variations
151
+ - Sensitivity to evolving conspiracy narratives
152
+
153
+ ## Citation
154
+
155
+ If you use this model, please cite:
156
+
157
+ ```
158
+ @article{marino2024one,
159
+ title={One Model to Detect Them All? Comparing LLMs, BERT and Traditional ML in Cross-Platform Conspiracy Detection},
160
+ author={Marino, Erik Bran and Vieira, Renata and Bassi, Davide and Ribeiro, Ana Sofia and Baleato, Suso},
161
+ year={2024}
162
+ }
163
+ ```
164
+
165
+ ## Contact
166
+
167
+ Erik Bran Marino (erik.marino@uevora.pt)
checkpoint-100/config.json ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "digitalepidemiologylab/covid-twitter-bert-v2",
3
+ "architectures": [
4
+ "BertForSequenceClassification"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "classifier_dropout": null,
8
+ "gradient_checkpointing": false,
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 1024,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 4096,
14
+ "layer_norm_eps": 1e-12,
15
+ "max_position_embeddings": 512,
16
+ "model_type": "bert",
17
+ "num_attention_heads": 16,
18
+ "num_hidden_layers": 24,
19
+ "pad_token_id": 0,
20
+ "position_embedding_type": "absolute",
21
+ "problem_type": "single_label_classification",
22
+ "torch_dtype": "float32",
23
+ "transformers_version": "4.48.2",
24
+ "type_vocab_size": 2,
25
+ "use_cache": true,
26
+ "vocab_size": 30522
27
+ }
checkpoint-100/model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b4e154db42681ad266530b961dcfeafaf1fb51bafb7f4c25376a578319eef6d9
3
+ size 1340622760
checkpoint-100/optimizer.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:53c8bd7c7688aecb7947fc7f1b1dc6e5884393163427eb8a8a1eea26ab91d46e
3
+ size 2681469421
checkpoint-100/rng_state.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0a52e1687874240dc2f195388b3a14e21795a08a2e06360226802394635c650d
3
+ size 13990
checkpoint-100/scheduler.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cffdc9546dda6f679ba0f40ea6b1378e5e22621f1578e0603b888c9a5ea82408
3
+ size 1064
checkpoint-100/trainer_state.json ADDED
@@ -0,0 +1,59 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_metric": 0.9056603773584906,
3
+ "best_model_checkpoint": "./ct-bert-finetuned-20250131_120923/checkpoint-100",
4
+ "epoch": 2.5,
5
+ "eval_steps": 100,
6
+ "global_step": 100,
7
+ "is_hyper_param_search": false,
8
+ "is_local_process_zero": true,
9
+ "is_world_process_zero": true,
10
+ "log_history": [
11
+ {
12
+ "epoch": 1.25,
13
+ "grad_norm": 8.891263008117676,
14
+ "learning_rate": 1.6666666666666667e-05,
15
+ "loss": 0.7518,
16
+ "step": 50
17
+ },
18
+ {
19
+ "epoch": 2.5,
20
+ "grad_norm": 2.668881893157959,
21
+ "learning_rate": 1.1111111111111113e-05,
22
+ "loss": 0.4192,
23
+ "step": 100
24
+ },
25
+ {
26
+ "epoch": 2.5,
27
+ "eval_accuracy": 0.90625,
28
+ "eval_f1": 0.9056603773584906,
29
+ "eval_loss": 0.28013086318969727,
30
+ "eval_precision": 0.9113924050632911,
31
+ "eval_recall": 0.9,
32
+ "eval_runtime": 7.1504,
33
+ "eval_samples_per_second": 89.505,
34
+ "eval_steps_per_second": 2.797,
35
+ "step": 100
36
+ }
37
+ ],
38
+ "logging_steps": 50,
39
+ "max_steps": 200,
40
+ "num_input_tokens_seen": 0,
41
+ "num_train_epochs": 5,
42
+ "save_steps": 100,
43
+ "stateful_callbacks": {
44
+ "TrainerControl": {
45
+ "args": {
46
+ "should_epoch_stop": false,
47
+ "should_evaluate": false,
48
+ "should_log": false,
49
+ "should_save": true,
50
+ "should_training_stop": false
51
+ },
52
+ "attributes": {}
53
+ }
54
+ },
55
+ "total_flos": 1490158249961472.0,
56
+ "train_batch_size": 16,
57
+ "trial_name": null,
58
+ "trial_params": null
59
+ }
checkpoint-100/training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:05473e8cb95480e7b8c3824554730bb4dc46f837ec677135d77b6cb6e30f65e1
3
+ size 5304
checkpoint-200/config.json ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "digitalepidemiologylab/covid-twitter-bert-v2",
3
+ "architectures": [
4
+ "BertForSequenceClassification"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "classifier_dropout": null,
8
+ "gradient_checkpointing": false,
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 1024,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 4096,
14
+ "layer_norm_eps": 1e-12,
15
+ "max_position_embeddings": 512,
16
+ "model_type": "bert",
17
+ "num_attention_heads": 16,
18
+ "num_hidden_layers": 24,
19
+ "pad_token_id": 0,
20
+ "position_embedding_type": "absolute",
21
+ "problem_type": "single_label_classification",
22
+ "torch_dtype": "float32",
23
+ "transformers_version": "4.48.2",
24
+ "type_vocab_size": 2,
25
+ "use_cache": true,
26
+ "vocab_size": 30522
27
+ }
checkpoint-200/model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:74310a36761f8d0b460e4f65110fdccff43327b32147ed43806cfb4da00c1f0d
3
+ size 1340622760
checkpoint-200/optimizer.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:205f183294405574d4a34acee4f15a9c122c88d037ed205829fd31233387a01b
3
+ size 2681469421
checkpoint-200/rng_state.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:313197ad728be17c0e8901a451d547f8b4bb02ccc62f005a6369fd5fcbb66fad
3
+ size 13990
checkpoint-200/scheduler.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a3c4f2daa2ec8a75526e136b8d019c39aa6267dcee395737b739960392e13892
3
+ size 1064
checkpoint-200/trainer_state.json ADDED
@@ -0,0 +1,85 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_metric": 0.9198184568835098,
3
+ "best_model_checkpoint": "./ct-bert-finetuned-20250131_120923/checkpoint-200",
4
+ "epoch": 5.0,
5
+ "eval_steps": 100,
6
+ "global_step": 200,
7
+ "is_hyper_param_search": false,
8
+ "is_local_process_zero": true,
9
+ "is_world_process_zero": true,
10
+ "log_history": [
11
+ {
12
+ "epoch": 1.25,
13
+ "grad_norm": 8.891263008117676,
14
+ "learning_rate": 1.6666666666666667e-05,
15
+ "loss": 0.7518,
16
+ "step": 50
17
+ },
18
+ {
19
+ "epoch": 2.5,
20
+ "grad_norm": 2.668881893157959,
21
+ "learning_rate": 1.1111111111111113e-05,
22
+ "loss": 0.4192,
23
+ "step": 100
24
+ },
25
+ {
26
+ "epoch": 2.5,
27
+ "eval_accuracy": 0.90625,
28
+ "eval_f1": 0.9056603773584906,
29
+ "eval_loss": 0.28013086318969727,
30
+ "eval_precision": 0.9113924050632911,
31
+ "eval_recall": 0.9,
32
+ "eval_runtime": 7.1504,
33
+ "eval_samples_per_second": 89.505,
34
+ "eval_steps_per_second": 2.797,
35
+ "step": 100
36
+ },
37
+ {
38
+ "epoch": 3.75,
39
+ "grad_norm": 3.1482038497924805,
40
+ "learning_rate": 5.555555555555557e-06,
41
+ "loss": 0.2009,
42
+ "step": 150
43
+ },
44
+ {
45
+ "epoch": 5.0,
46
+ "grad_norm": 3.4415204524993896,
47
+ "learning_rate": 0.0,
48
+ "loss": 0.0996,
49
+ "step": 200
50
+ },
51
+ {
52
+ "epoch": 5.0,
53
+ "eval_accuracy": 0.9171875,
54
+ "eval_f1": 0.9198184568835098,
55
+ "eval_loss": 0.2471379041671753,
56
+ "eval_precision": 0.8914956011730205,
57
+ "eval_recall": 0.95,
58
+ "eval_runtime": 7.131,
59
+ "eval_samples_per_second": 89.75,
60
+ "eval_steps_per_second": 2.805,
61
+ "step": 200
62
+ }
63
+ ],
64
+ "logging_steps": 50,
65
+ "max_steps": 200,
66
+ "num_input_tokens_seen": 0,
67
+ "num_train_epochs": 5,
68
+ "save_steps": 100,
69
+ "stateful_callbacks": {
70
+ "TrainerControl": {
71
+ "args": {
72
+ "should_epoch_stop": false,
73
+ "should_evaluate": false,
74
+ "should_log": false,
75
+ "should_save": true,
76
+ "should_training_stop": true
77
+ },
78
+ "attributes": {}
79
+ }
80
+ },
81
+ "total_flos": 2979850534241280.0,
82
+ "train_batch_size": 16,
83
+ "trial_name": null,
84
+ "trial_params": null
85
+ }
checkpoint-200/training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:05473e8cb95480e7b8c3824554730bb4dc46f837ec677135d77b6cb6e30f65e1
3
+ size 5304
config.json ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "digitalepidemiologylab/covid-twitter-bert-v2",
3
+ "architectures": [
4
+ "BertForSequenceClassification"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "classifier_dropout": null,
8
+ "gradient_checkpointing": false,
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 1024,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 4096,
14
+ "layer_norm_eps": 1e-12,
15
+ "max_position_embeddings": 512,
16
+ "model_type": "bert",
17
+ "num_attention_heads": 16,
18
+ "num_hidden_layers": 24,
19
+ "pad_token_id": 0,
20
+ "position_embedding_type": "absolute",
21
+ "problem_type": "single_label_classification",
22
+ "torch_dtype": "float32",
23
+ "transformers_version": "4.48.2",
24
+ "type_vocab_size": 2,
25
+ "use_cache": true,
26
+ "vocab_size": 30522
27
+ }
final_metrics.txt ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ test_accuracy: 0.8380
2
+ test_precision: 0.7747
3
+ test_recall: 0.9691
4
+ test_f1: 0.8611
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:74310a36761f8d0b460e4f65110fdccff43327b32147ed43806cfb4da00c1f0d
3
+ size 1340622760
special_tokens_map.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": "[CLS]",
3
+ "mask_token": "[MASK]",
4
+ "pad_token": "[PAD]",
5
+ "sep_token": "[SEP]",
6
+ "unk_token": "[UNK]"
7
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,59 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_basic_tokenize": true,
47
+ "do_lower_case": true,
48
+ "extra_special_tokens": {},
49
+ "full_tokenizer_file": null,
50
+ "mask_token": "[MASK]",
51
+ "model_max_length": 1000000000000000019884624838656,
52
+ "never_split": null,
53
+ "pad_token": "[PAD]",
54
+ "sep_token": "[SEP]",
55
+ "strip_accents": null,
56
+ "tokenize_chinese_chars": true,
57
+ "tokenizer_class": "BertTokenizer",
58
+ "unk_token": "[UNK]"
59
+ }
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:05473e8cb95480e7b8c3824554730bb4dc46f837ec677135d77b6cb6e30f65e1
3
+ size 5304
upload-script.py ADDED
@@ -0,0 +1,99 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ from huggingface_hub import HfApi, create_repo
3
+ from pathlib import Path
4
+ import sys
5
+
6
+
7
+ def check_files_exist(model_path):
8
+ """Check if all necessary files exist in the model directory."""
9
+ required_files = [
10
+ "config.json",
11
+ "model.safetensors",
12
+ "special_tokens_map.json",
13
+ "tokenizer.json",
14
+ "tokenizer_config.json",
15
+ "training_args.bin",
16
+ "vocab.txt",
17
+ "README.md"
18
+ ]
19
+
20
+ missing_files = []
21
+ for file in required_files:
22
+ if not os.path.exists(os.path.join(model_path, file)):
23
+ missing_files.append(file)
24
+
25
+ return missing_files
26
+
27
+
28
+ def upload_model_to_hf(model_path, repo_name, organization=None):
29
+ """
30
+ Upload a model to Hugging Face Hub.
31
+
32
+ Args:
33
+ model_path (str): Path to the model directory
34
+ repo_name (str): Name for the repository on Hugging Face
35
+ organization (str, optional): Organization name if uploading to an organization
36
+ """
37
+ try:
38
+ # Initialize the API
39
+ api = HfApi()
40
+
41
+ # Check if all required files exist
42
+ missing_files = check_files_exist(model_path)
43
+ if missing_files:
44
+ print(f"Error: Missing required files: {', '.join(missing_files)}")
45
+ return False
46
+
47
+ # Create full repository name
48
+ if organization:
49
+ full_repo_name = f"{organization}/{repo_name}"
50
+ else:
51
+ full_repo_name = f"{api.whoami()['name']}/{repo_name}"
52
+
53
+ print(f"Creating repository: {full_repo_name}")
54
+
55
+ # Create the repository
56
+ try:
57
+ create_repo(
58
+ repo_id=full_repo_name,
59
+ private=False,
60
+ exist_ok=True
61
+ )
62
+ except Exception as e:
63
+ print(f"Error creating repository: {str(e)}")
64
+ return False
65
+
66
+ print("Repository created successfully!")
67
+
68
+ # Upload the model files
69
+ print(f"Uploading files from {model_path}")
70
+ api.upload_folder(
71
+ folder_path=model_path,
72
+ repo_id=full_repo_name,
73
+ repo_type="model"
74
+ )
75
+
76
+ print("Upload completed successfully!")
77
+ print(f"Your model is now available at: https://huggingface.co/{full_repo_name}")
78
+ return True
79
+
80
+ except Exception as e:
81
+ print(f"An error occurred: {str(e)}")
82
+ return False
83
+
84
+
85
+ if __name__ == "__main__":
86
+ # Configurazione
87
+ MODEL_PATH = "/Users/erikbranmarino/BERT-PRCT-fine-tuning/ct-bert-finetuned-20250131_120923" # Il path al tuo modello
88
+ REPO_NAME = "CT-BERT-PRCT" # Il nome che vuoi dare al repository
89
+
90
+ # Verifica che il path esista
91
+ if not os.path.exists(MODEL_PATH):
92
+ print(f"Error: Model path {MODEL_PATH} does not exist!")
93
+ sys.exit(1)
94
+
95
+ # Esegui l'upload
96
+ success = upload_model_to_hf(MODEL_PATH, REPO_NAME)
97
+
98
+ if not success:
99
+ sys.exit(1)
vocab.txt ADDED
The diff for this file is too large to render. See raw diff