Leaky Model
This is a simple LSTM-based text generation model, designed to illustrate how models can leak sensitive data.
- The raw data used to train the model is comprised of a collection of penetration testing reports (in PDF format) taken from prior competition events. The original source files are available in the CPTC Report Examples repository.
- The codebase used to process the data and train this model is in the CPTC leaky_model repository.
This model contains the following files:
- text_generation_model.keras: trained LSTM (Long Short-Term Memory) neural network model saved in Keras format
- text_processor.pkl: This is a pickled (serialized) TextProcessor object containing:
- A fitted tokenizer with the vocabulary from the training data
- Sequence length configuration (default 50 tokens)
- Vocabulary size information
Usage
import tensorflow as tf
import pickle
import numpy as np
model_file = "text_generation_model.keras"
processor_file = "text_processor.pkl"
# Load model and processor
model = tf.keras.models.load_model(model_file)
with open(processor_file, 'rb') as f:
processor = pickle.load(f)
# Generation parameters
prompt = "Once upon a time"
max_tokens = 100
temperature = 1.7 # Higher = more random, Lower = more focused (default: 0.7)
top_k = 50 # Limit to top k tokens (set to 0 to disable)
top_p = 0.9 # Nucleus sampling threshold (set to 1.0 to disable)
# Process the prompt
tokenizer = processor['tokenizer']
sequence_length = processor['sequence_length']
current_sequence = tokenizer.texts_to_sequences([prompt])[0]
current_sequence = [0] * (sequence_length - len(current_sequence)) + current_sequence
current_sequence = np.array([current_sequence])
# Generate text
generated_text = prompt
for _ in range(max_tokens):
pred = model.predict(current_sequence, verbose=0)
logits = pred[0] / temperature
# Apply top-k filtering
if top_k > 0:
indices_to_remove = np.argsort(logits)[:-top_k]
logits[indices_to_remove] = -float('inf')
# Apply top-p filtering (nucleus sampling)
if top_p < 1.0:
sorted_logits = np.sort(logits)[::-1]
cumulative_probs = np.cumsum(tf.nn.softmax(sorted_logits))
sorted_indices_to_remove = cumulative_probs > top_p
sorted_indices_to_remove[1:] = sorted_indices_to_remove[:-1]
sorted_indices_to_remove[0] = False
indices_to_remove = np.argsort(logits)[::-1][sorted_indices_to_remove]
logits[indices_to_remove] = -float('inf')
# Sample from the filtered distribution
probs = tf.nn.softmax(logits).numpy()
next_token = np.random.choice(len(probs), p=probs)
# Get the word for this token
for word, index in tokenizer.word_index.items():
if index == next_token:
generated_text += ' ' + word
break
# Update sequence
current_sequence = np.array([current_sequence[0, 1:].tolist() + [next_token]])
print(generated_text)
- Downloads last month
- 2