Leaky Model

This is a simple LSTM-based text generation model, designed to illustrate how models can leak sensitive data.

The raw data used to train the model is comprised of a collection of penetration testing reports (in PDF format) taken from prior competition events. The original source files are available in the CPTC Report Examples repository.
The codebase used to process the data and train this model is in the CPTC leaky_model repository.

This model contains the following files:

text_generation_model.keras: trained LSTM (Long Short-Term Memory) neural network model saved in Keras format
text_processor.pkl: This is a pickled (serialized) TextProcessor object containing:
- A fitted tokenizer with the vocabulary from the training data
- Sequence length configuration (default 50 tokens)
- Vocabulary size information

Usage

import tensorflow as tf
import pickle
import numpy as np

model_file = "text_generation_model.keras"
processor_file = "text_processor.pkl"

# Load model and processor
model = tf.keras.models.load_model(model_file)
with open(processor_file, 'rb') as f:
    processor = pickle.load(f)

# Generation parameters
prompt = "Once upon a time"
max_tokens = 100
temperature = 1.7    # Higher = more random, Lower = more focused (default: 0.7)
top_k = 50          # Limit to top k tokens (set to 0 to disable)
top_p = 0.9         # Nucleus sampling threshold (set to 1.0 to disable)

# Process the prompt
tokenizer = processor['tokenizer']
sequence_length = processor['sequence_length']
current_sequence = tokenizer.texts_to_sequences([prompt])[0]
current_sequence = [0] * (sequence_length - len(current_sequence)) + current_sequence
current_sequence = np.array([current_sequence])

# Generate text
generated_text = prompt
for _ in range(max_tokens):
    pred = model.predict(current_sequence, verbose=0)
    logits = pred[0] / temperature

    # Apply top-k filtering
    if top_k > 0:
        indices_to_remove = np.argsort(logits)[:-top_k]
        logits[indices_to_remove] = -float('inf')

    # Apply top-p filtering (nucleus sampling)
    if top_p < 1.0:
        sorted_logits = np.sort(logits)[::-1]
        cumulative_probs = np.cumsum(tf.nn.softmax(sorted_logits))
        sorted_indices_to_remove = cumulative_probs > top_p
        sorted_indices_to_remove[1:] = sorted_indices_to_remove[:-1]
        sorted_indices_to_remove[0] = False
        indices_to_remove = np.argsort(logits)[::-1][sorted_indices_to_remove]
        logits[indices_to_remove] = -float('inf')

    # Sample from the filtered distribution
    probs = tf.nn.softmax(logits).numpy()
    next_token = np.random.choice(len(probs), p=probs)

    # Get the word for this token
    for word, index in tokenizer.word_index.items():
        if index == next_token:
            generated_text += ' ' + word
            break

    # Update sequence
    current_sequence = np.array([current_sequence[0, 1:].tolist() + [next_token]])

print(generated_text)