Update README with model card details (#5)

Browse files

- Updated README.md with Model Card (ebbece91a1dee4a2fbf8b5e4d1f3a9eee02c44ac)

Co-authored-by: Harshit <Tihsrah-CD@users.noreply.huggingface.co>

Files changed (1) hide show

README.md +182 -0

README.md CHANGED Viewed

@@ -1,3 +1,185 @@
 ---
 license: mit
 ---

 ---
 license: mit
+language:
+- en
 ---
+# Model Card for Model ID
+This model card outlines the Pebblo Classifier, a machine learning system specialized in text classification. Developed by DAXA.AI, this model is adept at categorizing various agreement documents within organizational structures, trained on 20 distinct labels.
+## Model Details
+### Model Description
+The Pebblo Classifier is a BERT-based model, fine-tuned from distilbert-base-uncased, targeting RAG (Retrieve-And-Generate) applications. It classifies text into categories such as "BOARD_MEETING_AGREEMENT," "CONSULTING_AGREEMENT," and others, streamlining document classification processes.
+- **Developed by:** DAXA.AI
+- **Funded by:** Open Source
+- **Model type:** Classification model
+- **Language(s) (NLP):** English
+- **License:** MIT
+- **Finetuned from model:** distilbert-base-uncased
+### Model Sources
+- **Repository:** [https://huggingface.co/daxa-ai/pebblo-classifier](https://huggingface.co/daxa-ai/pebblo-classifier?text=I+like+you.+I+love+you)
+- **Demo:** [https://huggingface.co/spaces/daxa-ai/Daxa-Classifier](https://huggingface.co/spaces/daxa-ai/Daxa-Classifier)
+## Uses
+### Intended Use
+The model is designed for direct application in document classification, capable of immediate deployment without additional fine-tuning.
+### Recommendations
+End-users should be cognizant of potential biases and limitations inherent in the model. For optimal use, understanding these aspects is recommended.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+```python
+# Import necessary libraries
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+import torch
+import joblib
+from huggingface_hub import hf_hub_url, cached_download
+# Load the tokenizer and model
+tokenizer = AutoTokenizer.from_pretrained("daxa-ai/pebblo-classifier")
+model = AutoModelForSequenceClassification.from_pretrained("daxa-ai/pebblo-classifier")
+# Example text
+text = "Please enter your text here."
+encoded_input = tokenizer(text, return_tensors='pt')
+output = model(**encoded_input)
+# Apply softmax to the logits
+probabilities = torch.nn.functional.softmax(output.logits, dim=-1)
+# Get the predicted label
+predicted_label = torch.argmax(probabilities, dim=-1)
+# URL of your Hugging Face model repository
+REPO_NAME = "daxa-ai/pebblo-classifier"
+# Path to the label encoder file in the repository
+LABEL_ENCODER_FILE = "label encoder.joblib"
+# Construct the URL to the label encoder file
+url = hf_hub_url(REPO_NAME, filename=LABEL_ENCODER_FILE)
+# Download and cache the label encoder file
+filename = cached_download(url)
+# Load the label encoder
+label_encoder = joblib.load(filename)
+# Decode the predicted label
+decoded_label = label_encoder.inverse_transform(predicted_label.numpy())
+print(decoded_label)
+```
+## Training Details
+### Training Data
+The training dataset consists of 131,771 entries, with 20 unique labels. The labels span various document types, with instances distributed across three text sizes (128 ± x, 256 ± x, and 512 ± x words; x varies within 20).
+Here are the labels along with their respective counts in the dataset:
+| Agreement Type                          | Instances |
+| --------------------------------------- | --------- |
+| BOARD_MEETING_AGREEMENT                 | 4,225     |
+| CONSULTING_AGREEMENT                    | 2,965     |
+| CUSTOMER_LIST_AGREEMENT                 | 9,000     |
+| DISTRIBUTION_PARTNER_AGREEMENT          | 8,339     |
+| EMPLOYEE_AGREEMENT                      | 3,921     |
+| ENTERPRISE_AGREEMENT                    | 3,820     |
+| ENTERPRISE_LICENSE_AGREEMENT            | 9,000     |
+| EXECUTIVE_SEVERANCE_AGREEMENT           | 9,000     |
+| FINANCIAL_REPORT_AGREEMENT              | 8,381     |
+| HARMFUL_ADVICE                          | 2,025     |
+| INTERNAL_PRODUCT_ROADMAP_AGREEMENT      | 7,037     |
+| LOAN_AND_SECURITY_AGREEMENT             | 9,000     |
+| MEDICAL_ADVICE                          | 2,359     |
+| MERGER_AGREEMENT                        | 7,706     |
+| NDA_AGREEMENT                           | 2,966     |
+| NORMAL_TEXT                             | 6,742     |
+| PATENT_APPLICATION_FILLINGS_AGREEMENT   | 9,000     |
+| PRICE_LIST_AGREEMENT                    | 9,000     |
+| SETTLEMENT_AGREEMENT                    | 9,000     |
+| SEXUAL_HARRASSMENT                      | 8,321     |
+## Evaluation
+### Testing Data & Metrics
+#### Testing Data
+Evaluation was performed on a dataset of 82,917 entries with a temperature range of 1-1.25 for randomness.
+Here are the labels along with their respective counts in the dataset:
+| Agreement Type                          | Instances |
+| --------------------------------------- | --------- |
+| BOARD_MEETING_AGREEMENT                 | 4,335     |
+| CONSULTING_AGREEMENT                    | 1,533     |
+| CUSTOMER_LIST_AGREEMENT                 | 4,995     |
+| DISTRIBUTION_PARTNER_AGREEMENT          | 7,231     |
+| EMPLOYEE_AGREEMENT                      | 1,433     |
+| ENTERPRISE_AGREEMENT                    | 1,616     |
+| ENTERPRISE_LICENSE_AGREEMENT            | 8,574     |
+| EXECUTIVE_SEVERANCE_AGREEMENT           | 5,177     |
+| FINANCIAL_REPORT_AGREEMENT              | 4,264     |
+| HARMFUL_ADVICE                          | 474       |
+| INTERNAL_PRODUCT_ROADMAP_AGREEMENT      | 4,116     |
+| LOAN_AND_SECURITY_AGREEMENT             | 6,354     |
+| MEDICAL_ADVICE                          | 289       |
+| MERGER_AGREEMENT                        | 7,079     |
+| NDA_AGREEMENT                           | 1,452     |
+| NORMAL_TEXT                             | 1,808     |
+| PATENT_APPLICATION_FILLINGS_AGREEMENT   | 6,177     |
+| PRICE_LIST_AGREEMENT                    | 5,453     |
+| SETTLEMENT_AGREEMENT                    | 5,806     |
+| SEXUAL_HARRASSMENT                      | 4,750     |
+#### Metrics
+| Agreement Type                              | precision | recall | f1-score | support |
+| ------------------------------------------- | --------- | ------ | -------- | ------- |
+| BOARD_MEETING_AGREEMENT                     | 0.93      | 0.95   | 0.94     | 4335    |
+| CONSULTING_AGREEMENT                        | 0.72      | 0.98   | 0.84     | 1593    |
+| CUSTOMER_LIST_AGREEMENT                     | 0.64      | 0.82   | 0.72     | 4335    |
+| DISTRIBUTION_PARTNER_AGREEMENT              | 0.83      | 0.47   | 0.61     | 7231    |
+| EMPLOYEE_AGREEMENT                          | 0.78      | 0.92   | 0.85     | 1333    |
+| ENTERPRISE_AGREEMENT                        | 0.29      | 0.40   | 0.34     | 1616    |
+| ENTERPRISE_LICENSE_AGREEMENT                | 0.88      | 0.79   | 0.83     | 5574    |
+| EXECUTIVE_SERVICE_AGREEMENT                 | 0.92      | 0.85   | 0.89     | 8177    |
+| FINANCIAL_REPORT_AGREEMENT                  | 0.89      | 0.98   | 0.93     | 4264    |
+| HARMFUL_ADVICE                              | 0.79      | 0.95   | 0.86     | 474     |
+| INTERNAL_PRODUCT_ROADMAP_AGREEMENT          | 0.91      | 0.98   | 0.94     | 4116    |
+| LOAN_AND_SECURITY_AGREEMENT                 | 0.77      | 0.98   | 0.86     | 6354    |
+| MEDICAL_ADVICE                              | 0.81      | 0.99   | 0.89     | 289     |
+| MERGER_AGREEMENT                            | 0.89      | 0.77   | 0.83     | 7279    |
+| NDA_AGREEMENT                               | 0.70      | 0.57   | 0.62     | 1452    |
+| NORMAL_TEXT                                 | 0.79      | 0.97   | 0.87     | 1888    |
+| PATENT_APPLICATION_FILLINGS_AGREEMENT       | 0.95      | 0.99   | 0.97     | 6177    |
+| PRICE_LIST_AGREEMENT                        | 0.60      | 0.75   | 0.67     | 5565    |
+| SETTLEMENT_AGREEMENT                        | 0.82      | 0.54   | 0.65     | 5843    |
+| SEXUAL_HARASSMENT                           | 0.97      | 0.94   | 0.95     | 440     |
+|                                             |           |        |          |         |
+| accuracy                                    |           |        | 0.79     | 82916   |
+| macro avg                                   | 0.79      | 0.83   | 0.80     | 82916   |
+| weighted avg                                | 0.83      | 0.81   | 0.81     | 82916   |
+#### Results
+The model's performance is summarized by precision, recall, and f1-score metrics, which are detailed across all 20 labels in the dataset. The accuracy stands at 0.79 for the entire test set, with a macro average and weighted average of precision, recall, and f1-score around 0.80 and 0.81, respectively.