File size: 8,563 Bytes

9cd46cf
 
3afb3ae
 
9cd46cf
3afb3ae

---
license: mit
language:
- en
---

# Model Card for Model ID

This model card outlines the Pebblo Classifier, a machine learning system specialized in text classification. Developed by DAXA.AI, this model is adept at categorizing various agreement documents within organizational structures, trained on 20 distinct labels.

## Model Details

### Model Description

The Pebblo Classifier is a BERT-based model, fine-tuned from distilbert-base-uncased, targeting RAG (Retrieve-And-Generate) applications. It classifies text into categories such as "BOARD_MEETING_AGREEMENT," "CONSULTING_AGREEMENT," and others, streamlining document classification processes.

- **Developed by:** DAXA.AI
- **Funded by:** Open Source
- **Model type:** Classification model
- **Language(s) (NLP):** English
- **License:** MIT
- **Finetuned from model:** distilbert-base-uncased

### Model Sources

- **Repository:** [https://huggingface.co/daxa-ai/pebblo-classifier](https://huggingface.co/daxa-ai/pebblo-classifier?text=I+like+you.+I+love+you)
- **Demo:** [https://huggingface.co/spaces/daxa-ai/Daxa-Classifier](https://huggingface.co/spaces/daxa-ai/Daxa-Classifier)

## Uses

### Intended Use

The model is designed for direct application in document classification, capable of immediate deployment without additional fine-tuning.

### Recommendations

End-users should be cognizant of potential biases and limitations inherent in the model. For optimal use, understanding these aspects is recommended.

## How to Get Started with the Model

Use the code below to get started with the model.

```python
# Import necessary libraries
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import joblib
from huggingface_hub import hf_hub_url, cached_download

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("daxa-ai/pebblo-classifier")
model = AutoModelForSequenceClassification.from_pretrained("daxa-ai/pebblo-classifier")

# Example text
text = "Please enter your text here."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

# Apply softmax to the logits
probabilities = torch.nn.functional.softmax(output.logits, dim=-1)

# Get the predicted label
predicted_label = torch.argmax(probabilities, dim=-1)

# URL of your Hugging Face model repository
REPO_NAME = "daxa-ai/pebblo-classifier"

# Path to the label encoder file in the repository
LABEL_ENCODER_FILE = "label encoder.joblib"

# Construct the URL to the label encoder file
url = hf_hub_url(REPO_NAME, filename=LABEL_ENCODER_FILE)

# Download and cache the label encoder file
filename = cached_download(url)

# Load the label encoder
label_encoder = joblib.load(filename)

# Decode the predicted label
decoded_label = label_encoder.inverse_transform(predicted_label.numpy())

print(decoded_label)

```

## Training Details

### Training Data

The training dataset consists of 131,771 entries, with 20 unique labels. The labels span various document types, with instances distributed across three text sizes (128 ± x, 256 ± x, and 512 ± x words; x varies within 20).
Here are the labels along with their respective counts in the dataset:

| Agreement Type                          | Instances |
| --------------------------------------- | --------- |
| BOARD_MEETING_AGREEMENT                 | 4,225     |
| CONSULTING_AGREEMENT                    | 2,965     |
| CUSTOMER_LIST_AGREEMENT                 | 9,000     |
| DISTRIBUTION_PARTNER_AGREEMENT          | 8,339     |
| EMPLOYEE_AGREEMENT                      | 3,921     |
| ENTERPRISE_AGREEMENT                    | 3,820     |
| ENTERPRISE_LICENSE_AGREEMENT            | 9,000     |
| EXECUTIVE_SEVERANCE_AGREEMENT           | 9,000     |
| FINANCIAL_REPORT_AGREEMENT              | 8,381     |
| HARMFUL_ADVICE                          | 2,025     |
| INTERNAL_PRODUCT_ROADMAP_AGREEMENT      | 7,037     |
| LOAN_AND_SECURITY_AGREEMENT             | 9,000     |
| MEDICAL_ADVICE                          | 2,359     |
| MERGER_AGREEMENT                        | 7,706     |
| NDA_AGREEMENT                           | 2,966     |
| NORMAL_TEXT                             | 6,742     |
| PATENT_APPLICATION_FILLINGS_AGREEMENT   | 9,000     |
| PRICE_LIST_AGREEMENT                    | 9,000     |
| SETTLEMENT_AGREEMENT                    | 9,000     |
| SEXUAL_HARRASSMENT                      | 8,321     |



## Evaluation

### Testing Data & Metrics

#### Testing Data
Evaluation was performed on a dataset of 82,917 entries with a temperature range of 1-1.25 for randomness.
Here are the labels along with their respective counts in the dataset:

| Agreement Type                          | Instances |
| --------------------------------------- | --------- |
| BOARD_MEETING_AGREEMENT                 | 4,335     |
| CONSULTING_AGREEMENT                    | 1,533     |
| CUSTOMER_LIST_AGREEMENT                 | 4,995     |
| DISTRIBUTION_PARTNER_AGREEMENT          | 7,231     |
| EMPLOYEE_AGREEMENT                      | 1,433     |
| ENTERPRISE_AGREEMENT                    | 1,616     |
| ENTERPRISE_LICENSE_AGREEMENT            | 8,574     |
| EXECUTIVE_SEVERANCE_AGREEMENT           | 5,177     |
| FINANCIAL_REPORT_AGREEMENT              | 4,264     |
| HARMFUL_ADVICE                          | 474       |
| INTERNAL_PRODUCT_ROADMAP_AGREEMENT      | 4,116     |
| LOAN_AND_SECURITY_AGREEMENT             | 6,354     |
| MEDICAL_ADVICE                          | 289       |
| MERGER_AGREEMENT                        | 7,079     |
| NDA_AGREEMENT                           | 1,452     |
| NORMAL_TEXT                             | 1,808     |
| PATENT_APPLICATION_FILLINGS_AGREEMENT   | 6,177     |
| PRICE_LIST_AGREEMENT                    | 5,453     |
| SETTLEMENT_AGREEMENT                    | 5,806     |
| SEXUAL_HARRASSMENT                      | 4,750     |



#### Metrics

| Agreement Type                              | precision | recall | f1-score | support |
| ------------------------------------------- | --------- | ------ | -------- | ------- |
| BOARD_MEETING_AGREEMENT                     | 0.93      | 0.95   | 0.94     | 4335    |
| CONSULTING_AGREEMENT                        | 0.72      | 0.98   | 0.84     | 1593    |
| CUSTOMER_LIST_AGREEMENT                     | 0.64      | 0.82   | 0.72     | 4335    |
| DISTRIBUTION_PARTNER_AGREEMENT              | 0.83      | 0.47   | 0.61     | 7231    |
| EMPLOYEE_AGREEMENT                          | 0.78      | 0.92   | 0.85     | 1333    |
| ENTERPRISE_AGREEMENT                        | 0.29      | 0.40   | 0.34     | 1616    |
| ENTERPRISE_LICENSE_AGREEMENT                | 0.88      | 0.79   | 0.83     | 5574    |
| EXECUTIVE_SERVICE_AGREEMENT                 | 0.92      | 0.85   | 0.89     | 8177    |
| FINANCIAL_REPORT_AGREEMENT                  | 0.89      | 0.98   | 0.93     | 4264    |
| HARMFUL_ADVICE                              | 0.79      | 0.95   | 0.86     | 474     |
| INTERNAL_PRODUCT_ROADMAP_AGREEMENT          | 0.91      | 0.98   | 0.94     | 4116    |
| LOAN_AND_SECURITY_AGREEMENT                 | 0.77      | 0.98   | 0.86     | 6354    |
| MEDICAL_ADVICE                              | 0.81      | 0.99   | 0.89     | 289     |
| MERGER_AGREEMENT                            | 0.89      | 0.77   | 0.83     | 7279    |
| NDA_AGREEMENT                               | 0.70      | 0.57   | 0.62     | 1452    |
| NORMAL_TEXT                                 | 0.79      | 0.97   | 0.87     | 1888    |
| PATENT_APPLICATION_FILLINGS_AGREEMENT       | 0.95      | 0.99   | 0.97     | 6177    |
| PRICE_LIST_AGREEMENT                        | 0.60      | 0.75   | 0.67     | 5565    |
| SETTLEMENT_AGREEMENT                        | 0.82      | 0.54   | 0.65     | 5843    |
| SEXUAL_HARASSMENT                           | 0.97      | 0.94   | 0.95     | 440     |
|                                             |           |        |          |         |
| accuracy                                    |           |        | 0.79     | 82916   |
| macro avg                                   | 0.79      | 0.83   | 0.80     | 82916   |
| weighted avg                                | 0.83      | 0.81   | 0.81     | 82916   |


#### Results

The model's performance is summarized by precision, recall, and f1-score metrics, which are detailed across all 20 labels in the dataset. The accuracy stands at 0.79 for the entire test set, with a macro average and weighted average of precision, recall, and f1-score around 0.80 and 0.81, respectively.