File size: 4,541 Bytes
8736038 2ff2376 0391fb4 4c48ccd 8736038 7221211 db25331 2ff2376 7221211 2ff2376 2308937 7221211 633a973 2ff2376 633a973 2ff2376 633a973 2ff2376 2308937 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 |
---
license: cc-by-nc-4.0
language:
- en
tags:
- cybersecurity
widget:
- text: "Native API functions such as <mask>, may be directed invoked via system calls/syscalls, but these features are also often exposed to user-mode applications via interfaces and libraries.."
example_title: Native API functions
- text: "One way of explicitly assigning the PPID of a new process is via the <mask> API call, which supports a parameter that defines the PPID to use."
example_title: Assigning the PPID of a new process
- text: "Enable Safe DLL Search Mode to force search for system DLLs in directories with greater restrictions (e.g. %<mask>%) to be used before local directory DLLs (e.g. a user's home directory)"
example_title: Enable Safe DLL Search Mode
- text: "GuLoader is a file downloader that has been used since at least December 2019 to distribute a variety of <mask>, including NETWIRE, Agent Tesla, NanoCore, and FormBook."
example_title: GuLoader is a file downloader
---
# SecureBERT+
This model represents an improved version of the [SecureBERT](https://huggingface.co/ehsanaghaei/SecureBERT) model, trained on a corpus eight times larger than its predecessor, leveraging the computational power of 8xA100 GPUs. This version, known as SecureBERT+, brings forth an average improvment of 9% in the performance of the Masked Language Model (MLM) task. This advancement signifies a substantial stride towards achieving heightened proficiency in language understanding and representation learning within the cybersecurity domain.
SecureBERT is a domain-specific language model based on RoBERTa which is trained on a huge amount of cybersecurity data and fine-tuned/tweaked to understand/represent cybersecurity textual data.
## Dataset
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6340b0bd77fd972573eb2f9b/pO-v6961YI1D0IPcm0027.png)
## Load Model
SecureBER+T has been uploaded to [Huggingface](https://huggingface.co/ehsanaghaei/SecureBERT_Plus) framework.
```python
from transformers import RobertaTokenizer, RobertaModel
import torch
tokenizer = RobertaTokenizer.from_pretrained("ehsanaghaei/SecureBERT_Plus")
model = RobertaModel.from_pretrained("ehsanaghaei/SecureBERT_Plus")
inputs = tokenizer("This is SecureBERT Plus!", return_tensors="pt")
outputs = model(**inputs)
last_hidden_states = outputs.last_hidden_state
```
## Fill Mask (MLM)
Use the code below to predict the masked word within the given sentences:
```python
#!pip install transformers
#!pip install torch
#!pip install tokenizers
import torch
import transformers
from transformers import RobertaTokenizer, RobertaTokenizerFast
tokenizer = RobertaTokenizerFast.from_pretrained("ehsanaghaei/SecureBERT_Plus")
model = transformers.RobertaForMaskedLM.from_pretrained("ehsanaghaei/SecureBERT_Plus")
def predict_mask(sent, tokenizer, model, topk =10, print_results = True):
token_ids = tokenizer.encode(sent, return_tensors='pt')
masked_position = (token_ids.squeeze() == tokenizer.mask_token_id).nonzero()
masked_pos = [mask.item() for mask in masked_position]
words = []
with torch.no_grad():
output = model(token_ids)
last_hidden_state = output[0].squeeze()
list_of_list = []
for index, mask_index in enumerate(masked_pos):
mask_hidden_state = last_hidden_state[mask_index]
idx = torch.topk(mask_hidden_state, k=topk, dim=0)[1]
words = [tokenizer.decode(i.item()).strip() for i in idx]
words = [w.replace(' ','') for w in words]
list_of_list.append(words)
if print_results:
print("Mask ", "Predictions: ", words)
best_guess = ""
for j in list_of_list:
best_guess = best_guess + "," + j[0]
return words
while True:
sent = input("Text here: \t")
print("SecureBERT: ")
predict_mask(sent, tokenizer, model)
print("===========================\n")
```
Other model variants:
[SecureGPT](https://huggingface.co/ehsanaghaei/SecureGPT)
[SecureDeBERTa](https://huggingface.co/ehsanaghaei/SecureDeBERTa)
[SecureBERT](https://huggingface.co/ehsanaghaei/SecureBERT)
# Reference
@inproceedings{aghaei2023securebert,
title={SecureBERT: A Domain-Specific Language Model for Cybersecurity},
author={Aghaei, Ehsan and Niu, Xi and Shadid, Waseem and Al-Shaer, Ehab},
booktitle={Security and Privacy in Communication Networks:
18th EAI International Conference, SecureComm 2022, Virtual Event,
October 2022,
Proceedings},
pages={39--56},
year={2023},
organization={Springer} }
|