CryptGPT: Privacy-Preserving Language Models Using Vigenere Cipher (Part 1)

Community Article Published June 16, 2024

tl;dr - we pretrained a gpt-2 tokenizer and model from scratch on a dataset encrypted with Vigenere cipher and it performs as well as regular gpt-2. Except in order to use it, you need to know the encryption key.

Privacy preserving here means protecting from the model provider. Imagine, if openai trained a GPT-4 using this scheme and gave the government the key. Then the govt can use it safely even while it’s hosted on OpenAI’s servers and OpenAI does not need to share the model weights with the govt.

Introduction

Language models like GPT-4 are pretty awesome. They can generate text, answer questions, and help with all sorts of tasks. But as they become more popular, people are starting to worry about privacy. How can we make sure the data used to train these models and the stuff they generate stays private?

In this first part of a series on privacy-preserving language models, I dive into a simple approach using the Vigenere cipher. The goal is to see if we can encrypt the training data and still use it without losing performance. In future posts, I'll explore more advanced methods like prefix keys and using ChaCha20 as the cipher.

The Problem

So here's the deal: language models are super useful, but they come with a privacy risk. When you train a model on text data, sometimes that data can be extracted later, which means private information might get exposed. This is a big concern for users who want to keep their data confidential.

The challenge is finding a way to train and use these models without compromising data privacy. Current methods like Secure Multiparty Computation and Homomorphic Encryption are too slow to be practical. We need a solution that actually works and is efficient.

Other Approaches

There are a few methods being explored to address privacy in language models:

Secure multiparty computation: Allows multiple parties to compute a function together while keeping their inputs private. But it's super slow and impractical for real-time use.
Homomorphic encryption: Lets you perform computations on encrypted data without decrypting it first. But it's also extremely computationally intensive and hasn't been widely used for large language models yet.
Differential privacy: Random noise is added to the data to protect individual privacy. It requires a large number of users to provide strong privacy guarantees and still relies on trusting the model provider to implement it correctly.

These methods have drawbacks that make them difficult to use in practice. These methods have drawbacks that make them difficult to use in practice though. There is no working FHE variant of LLMs, DP doesn't work well unless you have balanced usage. MPC is the only currently viable option but it's painfully slow.

To give you an idea of how slow MPC methods currently are: the fastest MPC implementation of LLaMA-7B takes five minutes per token at generation time, which increases with context size. Yikes!

Idea and Key Insights

The idea is to use encryption to protect the inputs and outputs of language models. Instead of relying on computationally intensive methods, I decided to explore a simple encryption method: the Vigenere cipher.

Vigenère cipher is one of the simplest (and oldest) ciphers that employs a form of polyalphabetic substitution (each letter is assigned more than one substitute).

Think caesar cipher but better. It was first described in 1553 but took an entire three centuries to break it in 1863. Weakness: If someone finds key length then this can be broken and short keys are easy to break.

Token Stability and Learning

In language models, text is tokenized into smaller pieces, and these tokens are used for training and generating text. For the model to learn effectively, the encryption method must maintain a stable, one-to-one correspondence between pieces of text and their ciphertext. This stability ensures that the model can learn the encrypted text in the same way it learns unencrypted text.

The Vigenere cipher, with its character-by-character substitution, provides this stability. For example, if the input string "123" is encrypted to "CA9," adding another character to the input, making it "1234," would result in "CA9D" (hypothetically speaking). This predictable and stable transformation allows the model to learn patterns in the encrypted text just as it would in plain text.

Trade-off Between Training and Inference Costs

It's important to understand the trade-off between training and inference costs. Methods like HE and MPC are intractable for language models due to their significant computational overhead during inference (test-time). However, we can shift this computational burden to the training phase. By pretraining the model on an encrypted corpus, the model can handle encrypted data without any extra computation at inference time.

CryptGPT currently uses the Vigenere cipher for initial validation. If the model can learn effectively from the Vigenere-encrypted text, it opens the door to using more robust encryption methods, like ChaCha20.

Training CryptGPT

To test whether a language model can learn from encrypted text, I experimented using the Vigenere cipher on GPT-2 architecture.

GPT-2 is a great starting point coz it's simple and provides a very solid baseline. GPT-2 comes in several sizes, from Small (124M params) to XL (1.5B). I chose GPT2-small and GPT2-large for this experiment.

Here's how I implemented the training run.

Important note: For brevity I trimmed the code snippets below, you can find the entire original code here.

Step 1: Encrypt the Dataset

First, we need to encrypt the dataset using the Vigenere cipher. Unlike typical Vigenere implementation, we aim to support all unicode characters. Below is the code that I used to perform encryption and decryption:

len_unicode = 55215  # NOT 65536 because surrogates are not allowed in python

def encrypt(message, key):
    encrypted = ""
    split_message = [
        message[i : i + len(key)] for i in range(0, len(message), len(key))
    ]

    for each_split in split_message:
        i = 0
        for letter in each_split:
            number = (ord(letter) + ord(key[i])) % len_unicode
            encrypted += chr(number)
            i += 1

    return encrypted

def decrypt(cipher, key):
    decrypted = ""
    split_encrypted = [
        cipher[i : i + len(key)] for i in range(0, len(cipher), len(key))
    ]

    for each_split in split_encrypted:
        i = 0
        for letter in each_split:
            number = (ord(letter) - ord(key[i])) % len_unicode
            decrypted += chr(number)
            i += 1

    return decrypted

Encrypt your dataset. We need to encrupt the dataset and then save it as files because the tokenizer training pipeline requires it to be split up in files. Here’s a snippet to illustrate the encryption process:

num_proc = multiprocessing.cpu_count()
dataset = load_dataset("openwebtext", num_proc=num_proc_load_dataset)
dataset = dataset.map(
    lambda row: dict(encrypted=encrypt_(row["text"]) + gpt2_tokenizer.eos_token),
    num_proc=num_proc-1
)

def combine_and_save(rows, idxs):
    idx = idxs[0]
    texts = rows["text"]
    save_dir = "./cryptgpt-data"
    file = f"{save_dir}/data-{idx}.txt"
    with open(file, 'w') as f:
        for text in texts:
            f.write(text)
    return dict(file=[file]*len(texts))

dataset.map(combine_and_save, batched=True, batch_size=1000, with_indices=True, num_proc=num_proc // 2)

Step 2: Train the Tokenizer

Next, train a Byte Pair Encoding tokenizer on the encrypted dataset using the tokenizers Python package. This process took 3 hours on a 96 vCPUs, 680 GB machine.

from tokenizers import Tokenizer, BPE, ByteLevel, BpeTrainer

eos_token = gpt2_tokenizer.eos_token

tokenizer = Tokenizer(BPE(unk_token=eos_token))

processor = ByteLevel()
tokenizer.processor = processor

tokenizer.enable_truncation(max_length=gpt2_tokenizer.model_max_length)

trainer = BpeTrainer(
    vocab_size=gpt2_tokenizer.vocab_size,
    special_tokens=[eos_token],
)

train_files = glob.glob("./cryptgpt-data/*.txt")

tokenizer.train(train_files, trainer=trainer)

Then we upload it to huggingface hub.

Step 3: Train the Model

Finally, train the model on the encrypted dataset using Axolotl:

Axolotl Configuration (YAML)

base_model: diwank/cryptgpt-large
hub_model_id: diwank/cryptgpt-large

model_type: GPT2LMHeadModel
tokenizer_type: AutoTokenizer
trust_remote_code: true  # required for CryptGPTTokenizer
output_dir: ./outputs/model-out

datasets:
  - path: diwank/encrypted-openwebtext
    type: completion

sequence_len: 1024
pad_to_sequence_len: true
train_on_inputs: true

gradient_accumulation_steps: 1
micro_batch_size: 128
optimizer: adamw_bnb_8bit
adam_beta1: 0.9
adam_beta2: 0.95
seed: 42

lr_scheduler: cosine
learning_rate: 6e-4
cosine_min_lr_ratio: 0.1  # min: 6e-5
weight_decay: 0.15

bf16: auto

max_steps: 600000

Model Training

I trained two variants of the model: GPT-2 (137M parameters) and GPT-2 Large (812M parameters). The training was conducted on an 8xA100 machine using the Axolotl training framework. Training the GPT-2 model took around 40 hours, while the GPT-2 Large model took approximately 80 hours.

Training Logs and Model Artifacts

You can find the training logs here:

The trained models are available on HuggingFace for further analysis:

The code for this project is on GitHub.

Results

The loss curves for both models indicate that they were able to learn from the encrypted data effectively. Here's the loss curve for the GPT-2 Large model:

The loss curve shows a steady decline, suggesting that the model can indeed learn meaningful patterns from the Vigenere-encrypted text. Pretty cool, right?

Limitations of This Approach

Model and Tokenizer Tied to the Key

The biggest limitation is that the model and tokenizer are tightly coupled to the encryption key used during training. This means that there's only one key per model and changing it will need retraining the entire model.

Susceptibility to Frequency Analysis Attacks

The Vigenere cipher, while effective for initial validation, is vulnerable to frequency analysis attacks. If an attacker has access to ample amounts of ciphered text, they can use the following methods to break the encryption:

Friedman Test Method: To determine the length of the key by analyzing the frequency of repeated sequences in the ciphertext.
Autocorrelation Test: Once the key length is known, the letter distributions and relative frequency histograms of the cipher subsequences can be used to deduce the keyword. This is done by comparing the frequency of characters in each subsequence to the expected frequency of characters in the language.

Model Weights Leakage

Even if sufficient ciphered text is not available, if the model weights are leaked, an attacker can still exploit this vulnerability. By sampling large amounts of text from the model, starting from random tokens and using a high temperature for generation, an attacker could generate enough text to perform frequency analysis and potentially deduce the encryption key.

Addressing Those Limitations

Decoupling the Model and Tokenizer from the Key

One way to address this is to freeze the inner layers of the model and retrain only the outer layers when the key changes. Since the core understanding of the encrypted data can be preserved in the inner layers, while the outer layers adapt to the new key.

Using a Stronger Encryption Algorithm

To counter the susceptibility to frequency analysis attacks, we can use a stronger encryption algorithm like XChaCha20. However, to maintain token stability, the nonce must be fixed (I know I know). This introduces some limitations but still enhances the security of the encrypted text. The challenge here is to balance the need for token stability with robust encryption.

Mitigating Model Weights Leakage

To mitigate this risk, we can use a secret prefix key. The prefix key is a specific sequence added to the beginning of each input during training. The model learns to produce meaningful text only when this prefix is present. Without the correct prefix, any text generated by the model will be gibberish. This ensures that even if the model weights are compromised, the attacker cannot generate useful outputs without the prefix key.

What's the point

Privacy! For instance, governments or private corporations could collaborate with organizations like OpenAI to train a model on their behalf using this approach. In this scenario, OpenAI wouldn't need to share the model weights or architecture, and the collaborating party wouldn't have to disclose their raw datasets or inference inputs and outputs.

Working Example

import os
os.environ["ENCRYPTION_KEY"] = "<you cant guess this>"

from cryptgpt.prepare_dataset import encrypt_, decrypt_
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

model = AutoModelForCausalLM.from_pretrained("diwank/cryptgpt-large")
tokenizer = AutoTokenizer.from_pretrained("diwank/cryptgpt-large")

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
decrypt_(pipe(encrypt_("hello"))[0]["generated_text"])

Result: 'hello\n\nGood luck to you!\n\nBaby boomers at this years American All-American team, Inc. are not re'

(I love what it generated in this response lol xD )

Future Work

This work was conducted at Julep AI, an AI lab dedicated to building infrastructure for AI agents. You can learn more about Julep AI here and we are always looking for amazing people to join our team! Do drop by and say hi on our discord server.

Summary and Future Directions

In this first part of our series, we demonstrated that language models could be trained on encrypted text using the Vigenere cipher without losing performance. We validated the hypothesis that encryption can be integrated into the training and inference processes effectively.

Future work will focus on:

Implementing stronger encryption methods like XChaCha20.
Exploring the use of prefix keys to further enhance security.
Refining the approach to decouple the model and tokenizer from the encryption key.

Part 2 of this series will delve deeper into these advanced methods and their implementation, aiming to make privacy-preserving language models even more robust and practical.

A Challenge for Cryptanalysts and LLM Researchers

I invite cryptanalysts and LLM researchers to try and find the key used in our experiments. Both the model weights and the tokenizer are open source, and you can access them here. As an added incentive, I offer 50 hours of my time (or an equivalent sum in dollars) to anyone who can successfully break the encryption. Good luck, and happy hacking! (Offer expires July 15th -- one month from this post)

You can find me on x/twitter and linkedin

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment

Upvote