File size: 1,487 Bytes
836e39f 59a3b63 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 |
---
license: mit
---
This model is a RoBERTa model trained on a programming language code - WolfSSL.
The programming language is C/C++, but the actual inference can also use other languages.
Using the model to unmask can be done in the following way
```python
from transformers import pipeline
unmasker = pipeline('fill-mask', model='mstaron/wolfBERTa')
unmasker("Hello I'm a <mask> model.")
```
To obtain the embeddings for downstream task can be done in the following way:
```python
# import the model via the huggingface library
from transformers import AutoTokenizer, AutoModelForMaskedLM
# load the tokenizer and the model for the pretrained wolfBERTa
tokenizer = AutoTokenizer.from_pretrained('mstaron/wolfBERTa')
# load the model
model = AutoModelForMaskedLM.from_pretrained("mstaron/wolfBERTa")
# import the feature extraction pipeline
from transformers import pipeline
# create the pipeline, which will extract the embedding vectors
# the models are already pre-defined, so we do not need to train anything here
features = pipeline(
"feature-extraction",
model=model,
tokenizer=tokenizer,
return_tensor = False
)
# extract the features == embeddings
lstFeatures = features('Class HTTP::X1')
# print the first token's embedding [CLS]
# which is also a good approximation of the whole sentence embedding
# the same as using np.mean(lstFeatures[0], axis=0)
lstFeatures[0][0]
```
In order to use the model, we need to train it on the downstream task.
|