--- license: apache-2.0 --- # Leaky Model This is a simple LSTM-based text generation model, designed to illustrate how models can leak sensitive data. * The raw data used to train the model is comprised of a collection of penetration testing reports (in PDF format) taken from prior competition events. The original source files are available in the [CPTC Report Examples](https://github.com/globalcptc/report_examples) repository. * The codebase used to process the data and train this model is in the [CPTC leaky_model](https://github.com/globalcptc/leaky_model) repository. This model contains the following files: * **text_generation_model.keras**: trained LSTM (Long Short-Term Memory) neural network model saved in Keras format * **text_processor.pkl**: This is a pickled (serialized) TextProcessor object containing: - A fitted tokenizer with the vocabulary from the training data - Sequence length configuration (default 50 tokens) - Vocabulary size information ## Usage ```python import tensorflow as tf import pickle import numpy as np model_file = "text_generation_model.keras" processor_file = "text_processor.pkl" # Load model and processor model = tf.keras.models.load_model(model_file) with open(processor_file, 'rb') as f: processor = pickle.load(f) # Generation parameters prompt = "Once upon a time" max_tokens = 100 temperature = 1.7 # Higher = more random, Lower = more focused (default: 0.7) top_k = 50 # Limit to top k tokens (set to 0 to disable) top_p = 0.9 # Nucleus sampling threshold (set to 1.0 to disable) # Process the prompt tokenizer = processor['tokenizer'] sequence_length = processor['sequence_length'] current_sequence = tokenizer.texts_to_sequences([prompt])[0] current_sequence = [0] * (sequence_length - len(current_sequence)) + current_sequence current_sequence = np.array([current_sequence]) # Generate text generated_text = prompt for _ in range(max_tokens): pred = model.predict(current_sequence, verbose=0) logits = pred[0] / temperature # Apply top-k filtering if top_k > 0: indices_to_remove = np.argsort(logits)[:-top_k] logits[indices_to_remove] = -float('inf') # Apply top-p filtering (nucleus sampling) if top_p < 1.0: sorted_logits = np.sort(logits)[::-1] cumulative_probs = np.cumsum(tf.nn.softmax(sorted_logits)) sorted_indices_to_remove = cumulative_probs > top_p sorted_indices_to_remove[1:] = sorted_indices_to_remove[:-1] sorted_indices_to_remove[0] = False indices_to_remove = np.argsort(logits)[::-1][sorted_indices_to_remove] logits[indices_to_remove] = -float('inf') # Sample from the filtered distribution probs = tf.nn.softmax(logits).numpy() next_token = np.random.choice(len(probs), p=probs) # Get the word for this token for word, index in tokenizer.word_index.items(): if index == next_token: generated_text += ' ' + word break # Update sequence current_sequence = np.array([current_sequence[0, 1:].tolist() + [next_token]]) print(generated_text) ```