# Harvard USPTO Dataset Training

## Preprocessing USPTO Data

### Importing the Dataset

We first need to import the actual USPTO dataset.

In [None]:
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.11.0-py3-none-any.whl (468 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m468.7/468.7 kB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess
  Downloading multiprocess-0.70.14-py39-none-any.whl (132 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.9/132.9 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting xxhash
  Downloading xxhash-3.2.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.2/212.2 kB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0.0,>=0.11.0
  Downloading huggingface_hub-0.13.4-py3-none-any.whl (200 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m200.1/200

In [None]:
from datasets import load_dataset
import pandas as pd
import numpy as np
import os
import json
import torch
import sys

### Loading the Dataset

We need to extract the dataset. We filter only for those in January 2016.

In [None]:
dataset_dict = load_dataset('HUPD/hupd',
    name='sample',
    data_files="https://huggingface.co/datasets/HUPD/hupd/blob/main/hupd_metadata_2022-02-22.feather", 
    icpr_label=None,
    train_filing_start_date='2016-01-01',
    train_filing_end_date='2016-01-21',
    val_filing_start_date='2016-01-22',
    val_filing_end_date='2016-01-31',
)

Downloading builder script:   0%|          | 0.00/14.7k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/10.9k [00:00<?, ?B/s]

Downloading and preparing dataset hupd/sample to /root/.cache/huggingface/datasets/HUPD___hupd/sample-a4eeba92b4229e93/0.0.0/6920d2def8fd7767046c0470603357f76866e5a09c97e19571896bfdca521142...
Loading dataset with config: PatentsConfig(name='sample', version=0.0.0, data_dir='sample', data_files={'train': ['https://huggingface.co/datasets/HUPD/hupd/blob/main/hupd_metadata_2022-02-22.feather']}, description='Patent data from January 2016, for debugging')


Downloading data:   0%|          | 0.00/6.67M [00:00<?, ?B/s]

Using metadata file: /root/.cache/huggingface/datasets/downloads/bac34b767c2799633010fa78ecd401d2eeffd62eff58abdb4db75829f8932710


Downloading data:   0%|          | 0.00/388M [00:00<?, ?B/s]

Reading metadata file: /root/.cache/huggingface/datasets/downloads/bac34b767c2799633010fa78ecd401d2eeffd62eff58abdb4db75829f8932710
Filtering train dataset by filing start date: 2016-01-01
Filtering train dataset by filing end date: 2016-01-21
Filtering val dataset by filing start date: 2016-01-22
Filtering val dataset by filing end date: 2016-01-31


Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Dataset hupd downloaded and prepared to /root/.cache/huggingface/datasets/HUPD___hupd/sample-a4eeba92b4229e93/0.0.0/6920d2def8fd7767046c0470603357f76866e5a09c97e19571896bfdca521142. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

We print out the dataset to understand what exactly we want to look for

In [None]:
print(dataset_dict)

DatasetDict({
    train: Dataset({
        features: ['patent_number', 'decision', 'title', 'abstract', 'claims', 'background', 'summary', 'description', 'cpc_label', 'ipc_label', 'filing_date', 'patent_issue_date', 'date_published', 'examiner_id'],
        num_rows: 16153
    })
    validation: Dataset({
        features: ['patent_number', 'decision', 'title', 'abstract', 'claims', 'background', 'summary', 'description', 'cpc_label', 'ipc_label', 'filing_date', 'patent_issue_date', 'date_published', 'examiner_id'],
        num_rows: 9094
    })
})


We separate our data between training and validation

In [None]:
df_train = pd.DataFrame(dataset_dict['train'] )
df_val = pd.DataFrame(dataset_dict['validation'] )

### Pre-Processing the Data

We are interested in the following columns:
- Patent Number <- purely for documentation purposes
- Abstract
- Claims
- Decision <- our `y`

Let's preprocess them both out of our training and validation data

Also, consider that the "Decision" column has three types of values: "Accepted", "Rejected", and "Pending". To remove unecessary baggage, we will be only looking for "Accepted" and "Rejected".

In [None]:
necessary_columns = ["patent_number","abstract","claims","decision"]
output_values = ['ACCEPTED','REJECTED'] 

In [None]:
trainFeaturesToDrop = [col for col in list(df_train.columns) if col not in necessary_columns]
trainDF = df_train.dropna()
trainDF.drop(columns=trainFeaturesToDrop, inplace=True)
trainDF = trainDF[trainDF['decision'].isin(output_values)]

In [None]:
trainDF

Unnamed: 0,patent_number,decision,abstract,claims
0,13261748,ACCEPTED,The present invention relates to passive optic...,"1. A compact optical network terminal, compris..."
1,13995128,ACCEPTED,Embodiments of the invention provide a method ...,1. A method comprising: using a first reader t...
3,14348792,ACCEPTED,A crystal growth furnace comprising a crucible...,1. A crystal growth furnace for growing a crys...
4,14360978,REJECTED,A shoe midsole is composed of a base plate (1)...,1. A sole member of footwear comprising a base...
5,14369795,ACCEPTED,"A ratchet tool includes a shaft member, a hand...","1. A ratchet tool, comprising a shaft member, ..."
...,...,...,...,...
16144,15002390,ACCEPTED,"A wavelength tunable laser device, including: ...","1. A wavelength tunable laser device, comprisi..."
16145,15002391,ACCEPTED,"In one aspect, a method for use in preparing a...","1. (canceled) 2. The method of claim 19, where..."
16148,15002394,ACCEPTED,A robot hand controlling method executes calcu...,"1. A controlling method of a robot hand, the r..."
16149,15002396,REJECTED,A fusion protein is disclosed. The fusion prot...,1. A fusion protein comprising an Fc fragment ...


In [None]:
valFeaturesToDrop = [col for col in list(df_val.columns) if col not in necessary_columns]
valDF = df_val.dropna()
valDF.drop(columns=valFeaturesToDrop, inplace=True)
valDF = valDF[valDF['decision'].isin(output_values)]

In [None]:
valDF

Unnamed: 0,patent_number,decision,abstract,claims
0,13144833,REJECTED,Regimen for the treatment of rosacea include t...,1. A treatment regimen comprising: cleansing a...
1,14006524,ACCEPTED,A clamp arrangement includes a pair of bracket...,1. A clamp arrangement for supporting a fractu...
2,14365653,REJECTED,A system and method for device action and conf...,1-20. (canceled) 21. A mobile device comprisin...
4,14396367,REJECTED,Systems and methods for managing datasets prod...,"1. A method, comprising: executing, by one or ..."
9,14416282,ACCEPTED,A scan driving circuit is provided. The scan d...,1. A scan driving circuit for driving a scan l...
...,...,...,...,...
9085,15011551,REJECTED,The non-rigid gate device as described may be ...,1; A non-rigid blocking apparatus referred to ...
9090,15011556,REJECTED,The present invention provides an improved unc...,1. A method for rendering a plastic surface am...
9091,15011557,ACCEPTED,A method for detecting a software-race conditi...,1. A method for detecting a software-race cond...
9092,15011558,ACCEPTED,The present application relates to multi-stage...,1. A multi-stage amplitude modulation-based me...


We need to replace the values in the `decision` column to numerical representations. We will set "ACCEPTED" as `1` and "REJECTED" as `0`.

In [None]:
yKey = {"ACCEPTED":1,"REJECTED":0}

In [None]:
trainDF2 = trainDF.replace({"decision": yKey})
valDF2 = valDF.replace({"decision": yKey})

In [None]:
trainDF2

Unnamed: 0,patent_number,decision,abstract,claims
0,13261748,1,The present invention relates to passive optic...,"1. A compact optical network terminal, compris..."
1,13995128,1,Embodiments of the invention provide a method ...,1. A method comprising: using a first reader t...
3,14348792,1,A crystal growth furnace comprising a crucible...,1. A crystal growth furnace for growing a crys...
4,14360978,0,A shoe midsole is composed of a base plate (1)...,1. A sole member of footwear comprising a base...
5,14369795,1,"A ratchet tool includes a shaft member, a hand...","1. A ratchet tool, comprising a shaft member, ..."
...,...,...,...,...
16144,15002390,1,"A wavelength tunable laser device, including: ...","1. A wavelength tunable laser device, comprisi..."
16145,15002391,1,"In one aspect, a method for use in preparing a...","1. (canceled) 2. The method of claim 19, where..."
16148,15002394,1,A robot hand controlling method executes calcu...,"1. A controlling method of a robot hand, the r..."
16149,15002396,0,A fusion protein is disclosed. The fusion prot...,1. A fusion protein comprising an Fc fragment ...


In [None]:
valDF2

Unnamed: 0,patent_number,decision,abstract,claims
0,13144833,0,Regimen for the treatment of rosacea include t...,1. A treatment regimen comprising: cleansing a...
1,14006524,1,A clamp arrangement includes a pair of bracket...,1. A clamp arrangement for supporting a fractu...
2,14365653,0,A system and method for device action and conf...,1-20. (canceled) 21. A mobile device comprisin...
4,14396367,0,Systems and methods for managing datasets prod...,"1. A method, comprising: executing, by one or ..."
9,14416282,1,A scan driving circuit is provided. The scan d...,1. A scan driving circuit for driving a scan l...
...,...,...,...,...
9085,15011551,0,The non-rigid gate device as described may be ...,1; A non-rigid blocking apparatus referred to ...
9090,15011556,0,The present invention provides an improved unc...,1. A method for rendering a plastic surface am...
9091,15011557,1,A method for detecting a software-race conditi...,1. A method for detecting a software-race cond...
9092,15011558,1,The present application relates to multi-stage...,1. A multi-stage amplitude modulation-based me...


We re-label the `decision` column to `label`.

In [None]:
trainDF3 = trainDF2.rename(columns={'decision': 'label'})
trainDF3

Unnamed: 0,patent_number,label,abstract,claims
0,13261748,1,The present invention relates to passive optic...,"1. A compact optical network terminal, compris..."
1,13995128,1,Embodiments of the invention provide a method ...,1. A method comprising: using a first reader t...
3,14348792,1,A crystal growth furnace comprising a crucible...,1. A crystal growth furnace for growing a crys...
4,14360978,0,A shoe midsole is composed of a base plate (1)...,1. A sole member of footwear comprising a base...
5,14369795,1,"A ratchet tool includes a shaft member, a hand...","1. A ratchet tool, comprising a shaft member, ..."
...,...,...,...,...
16144,15002390,1,"A wavelength tunable laser device, including: ...","1. A wavelength tunable laser device, comprisi..."
16145,15002391,1,"In one aspect, a method for use in preparing a...","1. (canceled) 2. The method of claim 19, where..."
16148,15002394,1,A robot hand controlling method executes calcu...,"1. A controlling method of a robot hand, the r..."
16149,15002396,0,A fusion protein is disclosed. The fusion prot...,1. A fusion protein comprising an Fc fragment ...


In [None]:
valDF3 = valDF2.rename(columns={'decision': 'label'})
valDF3

Unnamed: 0,patent_number,label,abstract,claims
0,13144833,0,Regimen for the treatment of rosacea include t...,1. A treatment regimen comprising: cleansing a...
1,14006524,1,A clamp arrangement includes a pair of bracket...,1. A clamp arrangement for supporting a fractu...
2,14365653,0,A system and method for device action and conf...,1-20. (canceled) 21. A mobile device comprisin...
4,14396367,0,Systems and methods for managing datasets prod...,"1. A method, comprising: executing, by one or ..."
9,14416282,1,A scan driving circuit is provided. The scan d...,1. A scan driving circuit for driving a scan l...
...,...,...,...,...
9085,15011551,0,The non-rigid gate device as described may be ...,1; A non-rigid blocking apparatus referred to ...
9090,15011556,0,The present invention provides an improved unc...,1. A method for rendering a plastic surface am...
9091,15011557,1,A method for detecting a software-race conditi...,1. A method for detecting a software-race cond...
9092,15011558,1,The present application relates to multi-stage...,1. A multi-stage amplitude modulation-based me...


We can grab the data for each column so that we have a list of values for training labels, training texts, validation labels, and validation texts.



In [None]:
trainData = {
  "patent_numbers":trainDF3["patent_number"].tolist(),
  "labels":trainDF3["label"].tolist(),
  "abstracts":trainDF3["abstract"].tolist(),
  "claims":trainDF3["claims"].tolist(),
}
valData = {
  "patent_numbers":valDF3["patent_number"].tolist(),
  "labels":valDF3["label"].tolist(),
  "abstracts":valDF3["abstract"].tolist(),
  "claims":valDF3["claims"].tolist(),
}

We will save these dictionaries as data for later.

In [None]:
if not os.path.exists("./data"):
  os.makedirs('./data')

with open("./data/train.json", "w") as outfile:
  json.dump(trainData, outfile, indent=2)
with open("./data/val.json", "w") as outfile:
  json.dump(valData, outfile, indent=2)

## Loading the Trainer

Now we can start training! This time, we will just go with `distilbert-base-uncased` for simplicity.

### Initializing Classes and Trainers

In [None]:
!pip install torch
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.28.1-py3-none-any.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m81.5 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m100.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, transformers
Successfully installed tokenizers-0.13.3 transformers-4.28.1


In [None]:
from torch.utils.data import Dataset, DataLoader
from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification
from transformers import Trainer, TrainingArguments, AdamW

In [39]:
torch.backends.cuda.matmul.allow_tf32 = True
model_name = "distilbert-base-uncased"
upsto_abstracts_model_path = './models/uspto_abstracts'
upsto_claims_model_path = './models/uspto_claims'

We will create a Dataset class for the training

In [None]:
class USPTODataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels
    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item
    def __len__(self):
        return len(self.labels)


### Double-Checking the Data

We will do a basic check: Do we have `trainData` and `valData` cached? If not, we need to load it in!

In [None]:
trainDataPath = "./data/train.json"
valDataPath = "./data/val.json"

if trainData is None and os.path.exists(trainDataPath):
  f = open(trainDataPath)
  trainData = json.load(f)
  f.close()
if valData is None and os.path.exists(valDataPath):
  f = open(valDataPath)
  valData = json.load(f)
  f.close()

### Encoding the Data

In [None]:
# Initializing the Tokenizer
tokenizer = DistilBertTokenizerFast.from_pretrained(model_name)

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

In [None]:
# Encoding the Data
train_abstracts_encodings = tokenizer(trainData["abstracts"], truncation=True, padding=True)
train_claims_encodings = tokenizer(trainData["claims"], truncation=True, padding=True)

In [None]:
# Creating the Datasets from the data
train_abstracts_dataset = USPTODataset(train_abstracts_encodings, trainData["labels"])
train_claims_dataset = USPTODataset(train_claims_encodings, trainData["labels"])

### Model Preparation

We need to initialize the model that we will use as a base now.

In [None]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model = DistilBertForSequenceClassification.from_pretrained(model_name)
model.to(device)
model.train()

Downloading pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.bias', 'classifier.w

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

### Training Preparation

In [None]:
train_abstracts_loader = DataLoader(train_abstracts_dataset, batch_size=32, shuffle=True)
train_claims_loader = DataLoader(train_claims_dataset, batch_size=32, shuffle=True)

In [None]:
optim = AdamW(model.parameters(), lr=5e-5)

### Training!

We will be training for 10 epochs

In [None]:
def Train(loader, save_path, num_train_epochs=2):
  batch_num = len(loader)
  for epoch in range(num_train_epochs):
    print(f'\t- Training epoch {epoch+1}/{num_train_epochs}')
    batch_count = 0
    for batch in loader:
      print(f'{batch_count}|{batch_num} - {round((batch_count/batch_num)*100)}%', end="")
      #print('\t\t- optim zero grad')
      optim.zero_grad()
      #print('\t\t- input_ids')
      input_ids = batch['input_ids'].to(device)
      #print('\t\t- attention_mask')
      attention_mask = batch['attention_mask'].to(device)
      #print('\t\t- labels0')
      labels = batch['labels'].to(device)
      #print('\t\t- outputs')
      outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
            
      #print('\t\t- loss')
      loss = outputs[0]
      #print('\t\t- backwards')
      loss.backward()
      #print('\t\t- step')
      optim.step()

      batch_count += 1
      print("\r", end="")
    
    model.save_pretrained(save_path, from_pt=True) 
    print(f'Saved model in {save_path}!\n')

In [None]:
print("=== TRAINING ABSTRACTS ===")
Train(train_abstracts_loader,upsto_abstracts_model_path, num_train_epochs=10)
print("----")
print("=== TRAINING CLAIMS ===")
Train(train_claims_loader,upsto_claims_model_path, num_train_epochs=10)

=== TRAINING ABSTRACTS ===
	- Training epoch 1/10
Saved model in ./models/upsto_abstracts!

	- Training epoch 2/10
Saved model in ./models/upsto_abstracts!

	- Training epoch 3/10
Saved model in ./models/upsto_abstracts!

	- Training epoch 4/10
Saved model in ./models/upsto_abstracts!

	- Training epoch 5/10
Saved model in ./models/upsto_abstracts!

	- Training epoch 6/10
Saved model in ./models/upsto_abstracts!

	- Training epoch 7/10
Saved model in ./models/upsto_abstracts!

	- Training epoch 8/10
Saved model in ./models/upsto_abstracts!

	- Training epoch 9/10
Saved model in ./models/upsto_abstracts!

	- Training epoch 10/10
Saved model in ./models/upsto_abstracts!

----
=== TRAINING CLAIMS ===
	- Training epoch 1/10
Saved model in ./models/upsto_claims!

	- Training epoch 2/10
Saved model in ./models/upsto_claims!

	- Training epoch 3/10
Saved model in ./models/upsto_claims!

	- Training epoch 4/10
Saved model in ./models/upsto_claims!

	- Training epoch 5/10
Saved model in ./model

In [40]:
import shutil
shutil.make_archive("uspto_abstracts", 'zip', './models/uspto_abstracts')
shutil.make_archive("uspto_claims", 'zip', './models/uspto_claims')


'/content/uspto_claims.zip'