Update README with model card details (#5)

3afb3ae verified 11 months ago

8.56 kB

	---
	license: mit
	language:
	- en
	---

	# Model Card for Model ID

	This model card outlines the Pebblo Classifier, a machine learning system specialized in text classification. Developed by DAXA.AI, this model is adept at categorizing various agreement documents within organizational structures, trained on 20 distinct labels.

	## Model Details

	### Model Description

	The Pebblo Classifier is a BERT-based model, fine-tuned from distilbert-base-uncased, targeting RAG (Retrieve-And-Generate) applications. It classifies text into categories such as "BOARD_MEETING_AGREEMENT," "CONSULTING_AGREEMENT," and others, streamlining document classification processes.

	- Developed by: DAXA.AI
	- Funded by: Open Source
	- Model type: Classification model
	- Language(s) (NLP): English
	- License: MIT
	- Finetuned from model: distilbert-base-uncased

	### Model Sources

	- Repository: [https://huggingface.co/daxa-ai/pebblo-classifier](https://huggingface.co/daxa-ai/pebblo-classifier?text=I+like+you.+I+love+you)
	- Demo: [https://huggingface.co/spaces/daxa-ai/Daxa-Classifier](https://huggingface.co/spaces/daxa-ai/Daxa-Classifier)

	## Uses

	### Intended Use

	The model is designed for direct application in document classification, capable of immediate deployment without additional fine-tuning.

	### Recommendations

	End-users should be cognizant of potential biases and limitations inherent in the model. For optimal use, understanding these aspects is recommended.

	## How to Get Started with the Model

	Use the code below to get started with the model.

	```python
	# Import necessary libraries
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch
	import joblib
	from huggingface_hub import hf_hub_url, cached_download

	# Load the tokenizer and model
	tokenizer = AutoTokenizer.from_pretrained("daxa-ai/pebblo-classifier")
	model = AutoModelForSequenceClassification.from_pretrained("daxa-ai/pebblo-classifier")

	# Example text
	text = "Please enter your text here."
	encoded_input = tokenizer(text, return_tensors='pt')
	output = model(**encoded_input)

	# Apply softmax to the logits
	probabilities = torch.nn.functional.softmax(output.logits, dim=-1)

	# Get the predicted label
	predicted_label = torch.argmax(probabilities, dim=-1)

	# URL of your Hugging Face model repository
	REPO_NAME = "daxa-ai/pebblo-classifier"

	# Path to the label encoder file in the repository
	LABEL_ENCODER_FILE = "label encoder.joblib"

	# Construct the URL to the label encoder file
	url = hf_hub_url(REPO_NAME, filename=LABEL_ENCODER_FILE)

	# Download and cache the label encoder file
	filename = cached_download(url)

	# Load the label encoder
	label_encoder = joblib.load(filename)

	# Decode the predicted label
	decoded_label = label_encoder.inverse_transform(predicted_label.numpy())

	print(decoded_label)

	```

	## Training Details

	### Training Data

	The training dataset consists of 131,771 entries, with 20 unique labels. The labels span various document types, with instances distributed across three text sizes (128 ± x, 256 ± x, and 512 ± x words; x varies within 20).
	Here are the labels along with their respective counts in the dataset:

	\| Agreement Type \| Instances \|
	\| --------------------------------------- \| --------- \|
	\| BOARD_MEETING_AGREEMENT \| 4,225 \|
	\| CONSULTING_AGREEMENT \| 2,965 \|
	\| CUSTOMER_LIST_AGREEMENT \| 9,000 \|
	\| DISTRIBUTION_PARTNER_AGREEMENT \| 8,339 \|
	\| EMPLOYEE_AGREEMENT \| 3,921 \|
	\| ENTERPRISE_AGREEMENT \| 3,820 \|
	\| ENTERPRISE_LICENSE_AGREEMENT \| 9,000 \|
	\| EXECUTIVE_SEVERANCE_AGREEMENT \| 9,000 \|
	\| FINANCIAL_REPORT_AGREEMENT \| 8,381 \|
	\| HARMFUL_ADVICE \| 2,025 \|
	\| INTERNAL_PRODUCT_ROADMAP_AGREEMENT \| 7,037 \|
	\| LOAN_AND_SECURITY_AGREEMENT \| 9,000 \|
	\| MEDICAL_ADVICE \| 2,359 \|
	\| MERGER_AGREEMENT \| 7,706 \|
	\| NDA_AGREEMENT \| 2,966 \|
	\| NORMAL_TEXT \| 6,742 \|
	\| PATENT_APPLICATION_FILLINGS_AGREEMENT \| 9,000 \|
	\| PRICE_LIST_AGREEMENT \| 9,000 \|
	\| SETTLEMENT_AGREEMENT \| 9,000 \|
	\| SEXUAL_HARRASSMENT \| 8,321 \|



	## Evaluation

	### Testing Data & Metrics

	#### Testing Data
	Evaluation was performed on a dataset of 82,917 entries with a temperature range of 1-1.25 for randomness.
	Here are the labels along with their respective counts in the dataset:

	\| Agreement Type \| Instances \|
	\| --------------------------------------- \| --------- \|
	\| BOARD_MEETING_AGREEMENT \| 4,335 \|
	\| CONSULTING_AGREEMENT \| 1,533 \|
	\| CUSTOMER_LIST_AGREEMENT \| 4,995 \|
	\| DISTRIBUTION_PARTNER_AGREEMENT \| 7,231 \|
	\| EMPLOYEE_AGREEMENT \| 1,433 \|
	\| ENTERPRISE_AGREEMENT \| 1,616 \|
	\| ENTERPRISE_LICENSE_AGREEMENT \| 8,574 \|
	\| EXECUTIVE_SEVERANCE_AGREEMENT \| 5,177 \|
	\| FINANCIAL_REPORT_AGREEMENT \| 4,264 \|
	\| HARMFUL_ADVICE \| 474 \|
	\| INTERNAL_PRODUCT_ROADMAP_AGREEMENT \| 4,116 \|
	\| LOAN_AND_SECURITY_AGREEMENT \| 6,354 \|
	\| MEDICAL_ADVICE \| 289 \|
	\| MERGER_AGREEMENT \| 7,079 \|
	\| NDA_AGREEMENT \| 1,452 \|
	\| NORMAL_TEXT \| 1,808 \|
	\| PATENT_APPLICATION_FILLINGS_AGREEMENT \| 6,177 \|
	\| PRICE_LIST_AGREEMENT \| 5,453 \|
	\| SETTLEMENT_AGREEMENT \| 5,806 \|
	\| SEXUAL_HARRASSMENT \| 4,750 \|



	#### Metrics

	\| Agreement Type \| precision \| recall \| f1-score \| support \|
	\| ------------------------------------------- \| --------- \| ------ \| -------- \| ------- \|
	\| BOARD_MEETING_AGREEMENT \| 0.93 \| 0.95 \| 0.94 \| 4335 \|
	\| CONSULTING_AGREEMENT \| 0.72 \| 0.98 \| 0.84 \| 1593 \|
	\| CUSTOMER_LIST_AGREEMENT \| 0.64 \| 0.82 \| 0.72 \| 4335 \|
	\| DISTRIBUTION_PARTNER_AGREEMENT \| 0.83 \| 0.47 \| 0.61 \| 7231 \|
	\| EMPLOYEE_AGREEMENT \| 0.78 \| 0.92 \| 0.85 \| 1333 \|
	\| ENTERPRISE_AGREEMENT \| 0.29 \| 0.40 \| 0.34 \| 1616 \|
	\| ENTERPRISE_LICENSE_AGREEMENT \| 0.88 \| 0.79 \| 0.83 \| 5574 \|
	\| EXECUTIVE_SERVICE_AGREEMENT \| 0.92 \| 0.85 \| 0.89 \| 8177 \|
	\| FINANCIAL_REPORT_AGREEMENT \| 0.89 \| 0.98 \| 0.93 \| 4264 \|
	\| HARMFUL_ADVICE \| 0.79 \| 0.95 \| 0.86 \| 474 \|
	\| INTERNAL_PRODUCT_ROADMAP_AGREEMENT \| 0.91 \| 0.98 \| 0.94 \| 4116 \|
	\| LOAN_AND_SECURITY_AGREEMENT \| 0.77 \| 0.98 \| 0.86 \| 6354 \|
	\| MEDICAL_ADVICE \| 0.81 \| 0.99 \| 0.89 \| 289 \|
	\| MERGER_AGREEMENT \| 0.89 \| 0.77 \| 0.83 \| 7279 \|
	\| NDA_AGREEMENT \| 0.70 \| 0.57 \| 0.62 \| 1452 \|
	\| NORMAL_TEXT \| 0.79 \| 0.97 \| 0.87 \| 1888 \|
	\| PATENT_APPLICATION_FILLINGS_AGREEMENT \| 0.95 \| 0.99 \| 0.97 \| 6177 \|
	\| PRICE_LIST_AGREEMENT \| 0.60 \| 0.75 \| 0.67 \| 5565 \|
	\| SETTLEMENT_AGREEMENT \| 0.82 \| 0.54 \| 0.65 \| 5843 \|
	\| SEXUAL_HARASSMENT \| 0.97 \| 0.94 \| 0.95 \| 440 \|
	\| \| \| \| \| \|
	\| accuracy \| \| \| 0.79 \| 82916 \|
	\| macro avg \| 0.79 \| 0.83 \| 0.80 \| 82916 \|
	\| weighted avg \| 0.83 \| 0.81 \| 0.81 \| 82916 \|


	#### Results

	The model's performance is summarized by precision, recall, and f1-score metrics, which are detailed across all 20 labels in the dataset. The accuracy stands at 0.79 for the entire test set, with a macro average and weighted average of precision, recall, and f1-score around 0.80 and 0.81, respectively.

	---
	license: mit
	language:
	- en
	---

	# Model Card for Model ID

	This model card outlines the Pebblo Classifier, a machine learning system specialized in text classification. Developed by DAXA.AI, this model is adept at categorizing various agreement documents within organizational structures, trained on 20 distinct labels.

	## Model Details

	### Model Description

	The Pebblo Classifier is a BERT-based model, fine-tuned from distilbert-base-uncased, targeting RAG (Retrieve-And-Generate) applications. It classifies text into categories such as "BOARD_MEETING_AGREEMENT," "CONSULTING_AGREEMENT," and others, streamlining document classification processes.

	- Developed by: DAXA.AI
	- Funded by: Open Source
	- Model type: Classification model
	- Language(s) (NLP): English
	- License: MIT
	- Finetuned from model: distilbert-base-uncased

	### Model Sources

	- Repository: [https://huggingface.co/daxa-ai/pebblo-classifier](https://huggingface.co/daxa-ai/pebblo-classifier?text=I+like+you.+I+love+you)
	- Demo: [https://huggingface.co/spaces/daxa-ai/Daxa-Classifier](https://huggingface.co/spaces/daxa-ai/Daxa-Classifier)

	## Uses

	### Intended Use

	The model is designed for direct application in document classification, capable of immediate deployment without additional fine-tuning.

	### Recommendations

	End-users should be cognizant of potential biases and limitations inherent in the model. For optimal use, understanding these aspects is recommended.

	## How to Get Started with the Model

	Use the code below to get started with the model.

	```python
	# Import necessary libraries
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch
	import joblib
	from huggingface_hub import hf_hub_url, cached_download

	# Load the tokenizer and model
	tokenizer = AutoTokenizer.from_pretrained("daxa-ai/pebblo-classifier")
	model = AutoModelForSequenceClassification.from_pretrained("daxa-ai/pebblo-classifier")

	# Example text
	text = "Please enter your text here."
	encoded_input = tokenizer(text, return_tensors='pt')
	output = model(**encoded_input)

	# Apply softmax to the logits
	probabilities = torch.nn.functional.softmax(output.logits, dim=-1)

	# Get the predicted label
	predicted_label = torch.argmax(probabilities, dim=-1)

	# URL of your Hugging Face model repository
	REPO_NAME = "daxa-ai/pebblo-classifier"

	# Path to the label encoder file in the repository
	LABEL_ENCODER_FILE = "label encoder.joblib"

	# Construct the URL to the label encoder file
	url = hf_hub_url(REPO_NAME, filename=LABEL_ENCODER_FILE)

	# Download and cache the label encoder file
	filename = cached_download(url)

	# Load the label encoder
	label_encoder = joblib.load(filename)

	# Decode the predicted label
	decoded_label = label_encoder.inverse_transform(predicted_label.numpy())

	print(decoded_label)

	```

	## Training Details

	### Training Data

	The training dataset consists of 131,771 entries, with 20 unique labels. The labels span various document types, with instances distributed across three text sizes (128 ± x, 256 ± x, and 512 ± x words; x varies within 20).
	Here are the labels along with their respective counts in the dataset:

	\| Agreement Type \| Instances \|
	\| --------------------------------------- \| --------- \|
	\| BOARD_MEETING_AGREEMENT \| 4,225 \|
	\| CONSULTING_AGREEMENT \| 2,965 \|
	\| CUSTOMER_LIST_AGREEMENT \| 9,000 \|
	\| DISTRIBUTION_PARTNER_AGREEMENT \| 8,339 \|
	\| EMPLOYEE_AGREEMENT \| 3,921 \|
	\| ENTERPRISE_AGREEMENT \| 3,820 \|
	\| ENTERPRISE_LICENSE_AGREEMENT \| 9,000 \|
	\| EXECUTIVE_SEVERANCE_AGREEMENT \| 9,000 \|
	\| FINANCIAL_REPORT_AGREEMENT \| 8,381 \|
	\| HARMFUL_ADVICE \| 2,025 \|
	\| INTERNAL_PRODUCT_ROADMAP_AGREEMENT \| 7,037 \|
	\| LOAN_AND_SECURITY_AGREEMENT \| 9,000 \|
	\| MEDICAL_ADVICE \| 2,359 \|
	\| MERGER_AGREEMENT \| 7,706 \|
	\| NDA_AGREEMENT \| 2,966 \|
	\| NORMAL_TEXT \| 6,742 \|
	\| PATENT_APPLICATION_FILLINGS_AGREEMENT \| 9,000 \|
	\| PRICE_LIST_AGREEMENT \| 9,000 \|
	\| SETTLEMENT_AGREEMENT \| 9,000 \|
	\| SEXUAL_HARRASSMENT \| 8,321 \|



	## Evaluation

	### Testing Data & Metrics

	#### Testing Data
	Evaluation was performed on a dataset of 82,917 entries with a temperature range of 1-1.25 for randomness.
	Here are the labels along with their respective counts in the dataset:

	\| Agreement Type \| Instances \|
	\| --------------------------------------- \| --------- \|
	\| BOARD_MEETING_AGREEMENT \| 4,335 \|
	\| CONSULTING_AGREEMENT \| 1,533 \|
	\| CUSTOMER_LIST_AGREEMENT \| 4,995 \|
	\| DISTRIBUTION_PARTNER_AGREEMENT \| 7,231 \|
	\| EMPLOYEE_AGREEMENT \| 1,433 \|
	\| ENTERPRISE_AGREEMENT \| 1,616 \|
	\| ENTERPRISE_LICENSE_AGREEMENT \| 8,574 \|
	\| EXECUTIVE_SEVERANCE_AGREEMENT \| 5,177 \|
	\| FINANCIAL_REPORT_AGREEMENT \| 4,264 \|
	\| HARMFUL_ADVICE \| 474 \|
	\| INTERNAL_PRODUCT_ROADMAP_AGREEMENT \| 4,116 \|
	\| LOAN_AND_SECURITY_AGREEMENT \| 6,354 \|
	\| MEDICAL_ADVICE \| 289 \|
	\| MERGER_AGREEMENT \| 7,079 \|
	\| NDA_AGREEMENT \| 1,452 \|
	\| NORMAL_TEXT \| 1,808 \|
	\| PATENT_APPLICATION_FILLINGS_AGREEMENT \| 6,177 \|
	\| PRICE_LIST_AGREEMENT \| 5,453 \|
	\| SETTLEMENT_AGREEMENT \| 5,806 \|
	\| SEXUAL_HARRASSMENT \| 4,750 \|



	#### Metrics

	\| Agreement Type \| precision \| recall \| f1-score \| support \|
	\| ------------------------------------------- \| --------- \| ------ \| -------- \| ------- \|
	\| BOARD_MEETING_AGREEMENT \| 0.93 \| 0.95 \| 0.94 \| 4335 \|
	\| CONSULTING_AGREEMENT \| 0.72 \| 0.98 \| 0.84 \| 1593 \|
	\| CUSTOMER_LIST_AGREEMENT \| 0.64 \| 0.82 \| 0.72 \| 4335 \|
	\| DISTRIBUTION_PARTNER_AGREEMENT \| 0.83 \| 0.47 \| 0.61 \| 7231 \|
	\| EMPLOYEE_AGREEMENT \| 0.78 \| 0.92 \| 0.85 \| 1333 \|
	\| ENTERPRISE_AGREEMENT \| 0.29 \| 0.40 \| 0.34 \| 1616 \|
	\| ENTERPRISE_LICENSE_AGREEMENT \| 0.88 \| 0.79 \| 0.83 \| 5574 \|
	\| EXECUTIVE_SERVICE_AGREEMENT \| 0.92 \| 0.85 \| 0.89 \| 8177 \|
	\| FINANCIAL_REPORT_AGREEMENT \| 0.89 \| 0.98 \| 0.93 \| 4264 \|
	\| HARMFUL_ADVICE \| 0.79 \| 0.95 \| 0.86 \| 474 \|
	\| INTERNAL_PRODUCT_ROADMAP_AGREEMENT \| 0.91 \| 0.98 \| 0.94 \| 4116 \|
	\| LOAN_AND_SECURITY_AGREEMENT \| 0.77 \| 0.98 \| 0.86 \| 6354 \|
	\| MEDICAL_ADVICE \| 0.81 \| 0.99 \| 0.89 \| 289 \|
	\| MERGER_AGREEMENT \| 0.89 \| 0.77 \| 0.83 \| 7279 \|
	\| NDA_AGREEMENT \| 0.70 \| 0.57 \| 0.62 \| 1452 \|
	\| NORMAL_TEXT \| 0.79 \| 0.97 \| 0.87 \| 1888 \|
	\| PATENT_APPLICATION_FILLINGS_AGREEMENT \| 0.95 \| 0.99 \| 0.97 \| 6177 \|
	\| PRICE_LIST_AGREEMENT \| 0.60 \| 0.75 \| 0.67 \| 5565 \|
	\| SETTLEMENT_AGREEMENT \| 0.82 \| 0.54 \| 0.65 \| 5843 \|
	\| SEXUAL_HARASSMENT \| 0.97 \| 0.94 \| 0.95 \| 440 \|
	\| \| \| \| \| \|
	\| accuracy \| \| \| 0.79 \| 82916 \|
	\| macro avg \| 0.79 \| 0.83 \| 0.80 \| 82916 \|
	\| weighted avg \| 0.83 \| 0.81 \| 0.81 \| 82916 \|


	#### Results

	The model's performance is summarized by precision, recall, and f1-score metrics, which are detailed across all 20 labels in the dataset. The accuracy stands at 0.79 for the entire test set, with a macro average and weighted average of precision, recall, and f1-score around 0.80 and 0.81, respectively.