Spaces:
Running
Running
File size: 2,589 Bytes
ce58055 64bf0d1 24a763d bc07fd4 4cc0cc9 bc07fd4 96d07c0 bc07fd4 d7ae984 bc07fd4 ece4a71 bc07fd4 fc5b4b2 d8e876f fc5b4b2 d8e876f bc07fd4 e36d818 c3dd1c6 29a87d5 c3dd1c6 fc5b4b2 d8e876f 53118ec 33c0c43 d8e876f fc5b4b2 d8e876f 53118ec d8e876f e36d818 e4efde8 03a5613 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 |
---
title: Can I Patent This
emoji: π
colorFrom: gray
colorTo: purple
sdk: streamlit
sdk_version: 1.21.0
app_file: app.py
pinned: false
---
# CS 670 Project - Finetuning Language Models
************************
Milestone-3 notebook: https://github.com/aye-thuzar/CS670Project/blob/main/CS670_milestone_3_AyeThuzar.ipynb
Hugging Face App: https://huggingface.co/spaces/ayethuzar/can-i-patent-this
Landing Page for the App: https://sites.google.com/view/cs670-finetuning-language-mode/home
App Demonstration Video:
************************
## Summary
***********
**milestone1:** https://github.com/aye-thuzar/CS670Project/blob/main/README_milestone_1.md
**milestone2:** https://github.com/aye-thuzar/CS670Project/blob/main/README_milestone-2.md
Dataset: https://github.com/suzgunmirac/hupd
**Data Preprocessing**
I used the load_dataset function to load all the patent applications that were filed to the USPTO in January 2016. We specify the date ranges of the training and validation sets as January 1-21, 2016 and January 22-31, 2016, respectively. This is a smaller dataset.
There are two datasets: train and validation. Here are the steps I did:
- Label-to-index mapping for the decision status field
- map the 'abstract' and 'claims' sections and tokenize them using pretrained('distilbert-base-uncased') tokenizer
- format them
- use DataLoader with batch_size = 16
**milestone3:**
The following notebook has the tuned model. There are 6 classes in the Harvard USPTO patent dataset and I decided to encode them as follow:
decision_to_str = {'REJECTED': 0, 'ACCEPTED': 1, 'PENDING': 1, 'CONT-REJECTED': 0, 'CONT-ACCEPTED': 1, 'CONT-PENDING': 1}
so that I can get a patentability score between 0 and 1.
I use the pertained-model 'distilbert-base-uncased' from the Hugging face hub and tune it with the smaller dataset.
The average accuracy of the validation set is about 89%.
milestone3 notebook: https://github.com/aye-thuzar/CS670Project/blob/main/CS670_milestone_3_AyeThuzar.ipynb
**milestone4:**
Please see Milestone4Documentation.md:
Here is the landing page for my app: https://sites.google.com/view/cs670-finetuning-language-mode/home
**************
References:
1. https://colab.research.google.com/drive/1_ZsI7WFTsEO0iu_0g3BLTkIkOUqPzCET?usp=sharing#scrollTo=B5wxZNhXdUK6
2. https://huggingface.co/AI-Growth-Lab/PatentSBERTa
3. https://huggingface.co/anferico/bert-for-patents
4. https://huggingface.co/transformers/v3.2.0/custom_datasets.html
5. https://colab.research.google.com/drive/1TzDDCDt368cUErH86Zc_P2aw9bXaaZy1?usp=sharing
|