|
--- |
|
license: mit |
|
language: |
|
- en |
|
--- |
|
|
|
# Model Card for Model ID |
|
|
|
This model card outlines the Pebblo Classifier, a machine learning system specialized in text classification. Developed by DAXA.AI, this model is adept at categorizing various agreement documents within organizational structures, trained on 20 distinct labels. |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
The Pebblo Classifier is a BERT-based model, fine-tuned from distilbert-base-uncased, targeting RAG (Retrieve-And-Generate) applications. It classifies text into categories such as "BOARD_MEETING_AGREEMENT," "CONSULTING_AGREEMENT," and others, streamlining document classification processes. |
|
|
|
- **Developed by:** DAXA.AI |
|
- **Funded by:** Open Source |
|
- **Model type:** Classification model |
|
- **Language(s) (NLP):** English |
|
- **License:** MIT |
|
- **Finetuned from model:** distilbert-base-uncased |
|
|
|
### Model Sources |
|
|
|
- **Repository:** [https://huggingface.co/daxa-ai/pebblo-classifier](https://huggingface.co/daxa-ai/pebblo-classifier?text=I+like+you.+I+love+you) |
|
- **Demo:** [https://huggingface.co/spaces/daxa-ai/Daxa-Classifier](https://huggingface.co/spaces/daxa-ai/Daxa-Classifier) |
|
|
|
## Uses |
|
|
|
### Intended Use |
|
|
|
The model is designed for direct application in document classification, capable of immediate deployment without additional fine-tuning. |
|
|
|
### Recommendations |
|
|
|
End-users should be cognizant of potential biases and limitations inherent in the model. For optimal use, understanding these aspects is recommended. |
|
|
|
## How to Get Started with the Model |
|
|
|
Use the code below to get started with the model. |
|
|
|
```python |
|
# Import necessary libraries |
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
import torch |
|
import joblib |
|
from huggingface_hub import hf_hub_url, cached_download |
|
|
|
# Load the tokenizer and model |
|
tokenizer = AutoTokenizer.from_pretrained("daxa-ai/pebblo-classifier") |
|
model = AutoModelForSequenceClassification.from_pretrained("daxa-ai/pebblo-classifier") |
|
|
|
# Example text |
|
text = "Please enter your text here." |
|
encoded_input = tokenizer(text, return_tensors='pt') |
|
output = model(**encoded_input) |
|
|
|
# Apply softmax to the logits |
|
probabilities = torch.nn.functional.softmax(output.logits, dim=-1) |
|
|
|
# Get the predicted label |
|
predicted_label = torch.argmax(probabilities, dim=-1) |
|
|
|
# URL of your Hugging Face model repository |
|
REPO_NAME = "daxa-ai/pebblo-classifier" |
|
|
|
# Path to the label encoder file in the repository |
|
LABEL_ENCODER_FILE = "label encoder.joblib" |
|
|
|
# Construct the URL to the label encoder file |
|
url = hf_hub_url(REPO_NAME, filename=LABEL_ENCODER_FILE) |
|
|
|
# Download and cache the label encoder file |
|
filename = cached_download(url) |
|
|
|
# Load the label encoder |
|
label_encoder = joblib.load(filename) |
|
|
|
# Decode the predicted label |
|
decoded_label = label_encoder.inverse_transform(predicted_label.numpy()) |
|
|
|
print(decoded_label) |
|
|
|
``` |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
The training dataset consists of 131,771 entries, with 20 unique labels. The labels span various document types, with instances distributed across three text sizes (128 ± x, 256 ± x, and 512 ± x words; x varies within 20). |
|
Here are the labels along with their respective counts in the dataset: |
|
|
|
| Agreement Type | Instances | |
|
| --------------------------------------- | --------- | |
|
| BOARD_MEETING_AGREEMENT | 4,225 | |
|
| CONSULTING_AGREEMENT | 2,965 | |
|
| CUSTOMER_LIST_AGREEMENT | 9,000 | |
|
| DISTRIBUTION_PARTNER_AGREEMENT | 8,339 | |
|
| EMPLOYEE_AGREEMENT | 3,921 | |
|
| ENTERPRISE_AGREEMENT | 3,820 | |
|
| ENTERPRISE_LICENSE_AGREEMENT | 9,000 | |
|
| EXECUTIVE_SEVERANCE_AGREEMENT | 9,000 | |
|
| FINANCIAL_REPORT_AGREEMENT | 8,381 | |
|
| HARMFUL_ADVICE | 2,025 | |
|
| INTERNAL_PRODUCT_ROADMAP_AGREEMENT | 7,037 | |
|
| LOAN_AND_SECURITY_AGREEMENT | 9,000 | |
|
| MEDICAL_ADVICE | 2,359 | |
|
| MERGER_AGREEMENT | 7,706 | |
|
| NDA_AGREEMENT | 2,966 | |
|
| NORMAL_TEXT | 6,742 | |
|
| PATENT_APPLICATION_FILLINGS_AGREEMENT | 9,000 | |
|
| PRICE_LIST_AGREEMENT | 9,000 | |
|
| SETTLEMENT_AGREEMENT | 9,000 | |
|
| SEXUAL_HARRASSMENT | 8,321 | |
|
|
|
|
|
|
|
## Evaluation |
|
|
|
### Testing Data & Metrics |
|
|
|
#### Testing Data |
|
Evaluation was performed on a dataset of 82,917 entries with a temperature range of 1-1.25 for randomness. |
|
Here are the labels along with their respective counts in the dataset: |
|
|
|
| Agreement Type | Instances | |
|
| --------------------------------------- | --------- | |
|
| BOARD_MEETING_AGREEMENT | 4,335 | |
|
| CONSULTING_AGREEMENT | 1,533 | |
|
| CUSTOMER_LIST_AGREEMENT | 4,995 | |
|
| DISTRIBUTION_PARTNER_AGREEMENT | 7,231 | |
|
| EMPLOYEE_AGREEMENT | 1,433 | |
|
| ENTERPRISE_AGREEMENT | 1,616 | |
|
| ENTERPRISE_LICENSE_AGREEMENT | 8,574 | |
|
| EXECUTIVE_SEVERANCE_AGREEMENT | 5,177 | |
|
| FINANCIAL_REPORT_AGREEMENT | 4,264 | |
|
| HARMFUL_ADVICE | 474 | |
|
| INTERNAL_PRODUCT_ROADMAP_AGREEMENT | 4,116 | |
|
| LOAN_AND_SECURITY_AGREEMENT | 6,354 | |
|
| MEDICAL_ADVICE | 289 | |
|
| MERGER_AGREEMENT | 7,079 | |
|
| NDA_AGREEMENT | 1,452 | |
|
| NORMAL_TEXT | 1,808 | |
|
| PATENT_APPLICATION_FILLINGS_AGREEMENT | 6,177 | |
|
| PRICE_LIST_AGREEMENT | 5,453 | |
|
| SETTLEMENT_AGREEMENT | 5,806 | |
|
| SEXUAL_HARRASSMENT | 4,750 | |
|
|
|
|
|
|
|
#### Metrics |
|
|
|
| Agreement Type | precision | recall | f1-score | support | |
|
| ------------------------------------------- | --------- | ------ | -------- | ------- | |
|
| BOARD_MEETING_AGREEMENT | 0.93 | 0.95 | 0.94 | 4335 | |
|
| CONSULTING_AGREEMENT | 0.72 | 0.98 | 0.84 | 1593 | |
|
| CUSTOMER_LIST_AGREEMENT | 0.64 | 0.82 | 0.72 | 4335 | |
|
| DISTRIBUTION_PARTNER_AGREEMENT | 0.83 | 0.47 | 0.61 | 7231 | |
|
| EMPLOYEE_AGREEMENT | 0.78 | 0.92 | 0.85 | 1333 | |
|
| ENTERPRISE_AGREEMENT | 0.29 | 0.40 | 0.34 | 1616 | |
|
| ENTERPRISE_LICENSE_AGREEMENT | 0.88 | 0.79 | 0.83 | 5574 | |
|
| EXECUTIVE_SERVICE_AGREEMENT | 0.92 | 0.85 | 0.89 | 8177 | |
|
| FINANCIAL_REPORT_AGREEMENT | 0.89 | 0.98 | 0.93 | 4264 | |
|
| HARMFUL_ADVICE | 0.79 | 0.95 | 0.86 | 474 | |
|
| INTERNAL_PRODUCT_ROADMAP_AGREEMENT | 0.91 | 0.98 | 0.94 | 4116 | |
|
| LOAN_AND_SECURITY_AGREEMENT | 0.77 | 0.98 | 0.86 | 6354 | |
|
| MEDICAL_ADVICE | 0.81 | 0.99 | 0.89 | 289 | |
|
| MERGER_AGREEMENT | 0.89 | 0.77 | 0.83 | 7279 | |
|
| NDA_AGREEMENT | 0.70 | 0.57 | 0.62 | 1452 | |
|
| NORMAL_TEXT | 0.79 | 0.97 | 0.87 | 1888 | |
|
| PATENT_APPLICATION_FILLINGS_AGREEMENT | 0.95 | 0.99 | 0.97 | 6177 | |
|
| PRICE_LIST_AGREEMENT | 0.60 | 0.75 | 0.67 | 5565 | |
|
| SETTLEMENT_AGREEMENT | 0.82 | 0.54 | 0.65 | 5843 | |
|
| SEXUAL_HARASSMENT | 0.97 | 0.94 | 0.95 | 440 | |
|
| | | | | | |
|
| accuracy | | | 0.79 | 82916 | |
|
| macro avg | 0.79 | 0.83 | 0.80 | 82916 | |
|
| weighted avg | 0.83 | 0.81 | 0.81 | 82916 | |
|
|
|
|
|
#### Results |
|
|
|
The model's performance is summarized by precision, recall, and f1-score metrics, which are detailed across all 20 labels in the dataset. The accuracy stands at 0.79 for the entire test set, with a macro average and weighted average of precision, recall, and f1-score around 0.80 and 0.81, respectively. |
|
|
|
|