ehsanaghaei commited on
Commit
2308937
1 Parent(s): 7221211

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +80 -0
README.md CHANGED
@@ -14,6 +14,72 @@ SecureBERT is a domain-specific language model based on RoBERTa which is trained
14
  ## Dataset
15
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6340b0bd77fd972573eb2f9b/pO-v6961YI1D0IPcm0027.png)
16
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
  Other model variants:
18
 
19
  [SecureGPT](https://huggingface.co/ehsanaghaei/SecureGPT)
@@ -23,3 +89,17 @@ Other model variants:
23
  [SecureBERT](https://huggingface.co/ehsanaghaei/SecureBERT)
24
 
25
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
  ## Dataset
15
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6340b0bd77fd972573eb2f9b/pO-v6961YI1D0IPcm0027.png)
16
 
17
+ ## Load Model
18
+ SecureBER+T has been uploaded to [Huggingface](https://huggingface.co/ehsanaghaei/SecureBERT_Plus) framework.
19
+
20
+ ```python
21
+ from transformers import RobertaTokenizer, RobertaModel
22
+ import torch
23
+
24
+ tokenizer = RobertaTokenizer.from_pretrained("ehsanaghaei/SecureBERT_Plus")
25
+ model = RobertaModel.from_pretrained("ehsanaghaei/SecureBERT_Plus")
26
+
27
+ inputs = tokenizer("This is SecureBERT Plus!", return_tensors="pt")
28
+ outputs = model(**inputs)
29
+
30
+ last_hidden_states = outputs.last_hidden_state
31
+ ```
32
+
33
+ ## Fill Mask (MLM)
34
+ Use the code below to predict the masked word within the given sentences:
35
+
36
+ ```python
37
+ #!pip install transformers
38
+ #!pip install torch
39
+ #!pip install tokenizers
40
+
41
+ import torch
42
+ import transformers
43
+ from transformers import RobertaTokenizer, RobertaTokenizerFast
44
+
45
+ tokenizer = RobertaTokenizerFast.from_pretrained("ehsanaghaei/SecureBERT_Plus")
46
+ model = transformers.RobertaForMaskedLM.from_pretrained("ehsanaghaei/SecureBERT_Plus")
47
+
48
+ def predict_mask(sent, tokenizer, model, topk =10, print_results = True):
49
+ token_ids = tokenizer.encode(sent, return_tensors='pt')
50
+ masked_position = (token_ids.squeeze() == tokenizer.mask_token_id).nonzero()
51
+ masked_pos = [mask.item() for mask in masked_position]
52
+ words = []
53
+ with torch.no_grad():
54
+ output = model(token_ids)
55
+
56
+ last_hidden_state = output[0].squeeze()
57
+
58
+ list_of_list = []
59
+ for index, mask_index in enumerate(masked_pos):
60
+ mask_hidden_state = last_hidden_state[mask_index]
61
+ idx = torch.topk(mask_hidden_state, k=topk, dim=0)[1]
62
+ words = [tokenizer.decode(i.item()).strip() for i in idx]
63
+ words = [w.replace(' ','') for w in words]
64
+ list_of_list.append(words)
65
+ if print_results:
66
+ print("Mask ", "Predictions: ", words)
67
+
68
+ best_guess = ""
69
+ for j in list_of_list:
70
+ best_guess = best_guess + "," + j[0]
71
+
72
+ return words
73
+
74
+
75
+ while True:
76
+ sent = input("Text here: \t")
77
+ print("SecureBERT: ")
78
+ predict_mask(sent, tokenizer, model)
79
+
80
+ print("===========================\n")
81
+ ```
82
+
83
  Other model variants:
84
 
85
  [SecureGPT](https://huggingface.co/ehsanaghaei/SecureGPT)
 
89
  [SecureBERT](https://huggingface.co/ehsanaghaei/SecureBERT)
90
 
91
 
92
+ # Reference
93
+ @inproceedings{aghaei2023securebert,
94
+ title={SecureBERT: A Domain-Specific Language Model for Cybersecurity},
95
+ author={Aghaei, Ehsan and Niu, Xi and Shadid, Waseem and Al-Shaer, Ehab},
96
+ booktitle={Security and Privacy in Communication Networks:
97
+ 18th EAI International Conference, SecureComm 2022, Virtual Event,
98
+ October 2022,
99
+ Proceedings},
100
+ pages={39--56},
101
+ year={2023},
102
+ organization={Springer} }
103
+
104
+
105
+