Add SetFit model

Browse files

Files changed (6) hide show

README.md +23 -43
config.json +1 -1
model.safetensors +1 -1
model_head.pkl +1 -1
special_tokens_map.json +2 -2
tokenizer_config.json +7 -0

README.md CHANGED Viewed

@@ -5,65 +5,45 @@ tags:
 - sentence-transformers
 - text-classification
 pipeline_tag: text-classification
-language:
-- en
-metrics:
-- accuracy
 ---
 # Kaleemullah/paraphrase-mpnet-base-v2-ads-nonads-classifier
-## Introduction
-The "paraphrase-mpnet-base-v2-ads-nonads-classifier" is a SetFit model tailored for efficient text classification, specifically designed to differentiate between ad-related and non-ad content. This model leverages advanced few-shot learning techniques, making it a powerful tool for applications in content analysis and digital marketing.
-## Model Description
-This model is built upon a Sentence Transformer, fine-tuned through contrastive learning for enhanced feature extraction capabilities. It incorporates a classification head trained with these fine-tuned features, enabling high accuracy and efficiency in classifying texts into ad-related or non-ad categories. The model is particularly effective in processing and understanding the subtle differences in language used in advertising content.
-![image/png](https://cdn-uploads.huggingface.co/production/uploads/61b2c130ac5ecaae3d1efe27/zS37gyEQW7MhP9CwWb7zZ.png)
-## Training Data
-The model was trained on a comprehensive dataset consisting of various examples of ad-related and non-ad content sourced from multiple digital marketing platforms. The dataset includes a diverse range of linguistic styles and terminologies, ensuring robustness and versatility in the model's classification ability.
-## Performance Metrics
-The model performs excellently in text classification tasks, highlighted by its accuracy, precision, recall, and F1 score metrics. These metrics underscore the model's effectiveness in correctly identifying and classifying ad and non-ad content.
-## Installation and Usage
-To install the SetFit library, use one of the following commands:
 ```bash
-python3 -m pip install setfit
-# or
-pip install setfit
 ```
-You can then run the inference as follows:
 ```python
 from setfit import SetFitModel
-# Initialize and download the model from the HuggingFace Hub
 model = SetFitModel.from_pretrained("Kaleemullah/paraphrase-mpnet-base-v2-ads-nonads-classifier")
-# Example texts for classification
-texts = ["xiaomi phone best with best camera quality ", "Pineapple on pizza is the worst 🤮"]
 # Run inference
-# This will classify each text in the 'texts' list as ad-related or non-ad content
-predictions = model(texts)
-# 'predictions' will be a list of labels corresponding to each input text
-for text, pred in zip(texts, predictions):
-    print(f"Text: {text}\nPredicted Category: {'Ad' if pred else 'Non-Ad'}\n")
 ```
-![image/png](https://cdn-uploads.huggingface.co/production/uploads/61b2c130ac5ecaae3d1efe27/63Flu2lnmR6XS6E2hm9MY.png)
-## Some Other Non-Ads Content Example
-![image/png](https://cdn-uploads.huggingface.co/production/uploads/61b2c130ac5ecaae3d1efe27/rZzbGFWXb4tYRhcO46LB7.png)

 - sentence-transformers
 - text-classification
 pipeline_tag: text-classification
 ---
 # Kaleemullah/paraphrase-mpnet-base-v2-ads-nonads-classifier
+This is a [SetFit model](https://github.com/huggingface/setfit) that can be used for text classification. The model has been trained using an efficient few-shot learning technique that involves:
+1. Fine-tuning a [Sentence Transformer](https://www.sbert.net) with contrastive learning.
+2. Training a classification head with features from the fine-tuned Sentence Transformer.
+## Usage
+To use this model for inference, first install the SetFit library:
 ```bash
+python -m pip install setfit
 ```
+You can then run inference as follows:
 ```python
 from setfit import SetFitModel
+# Download from Hub and run inference
 model = SetFitModel.from_pretrained("Kaleemullah/paraphrase-mpnet-base-v2-ads-nonads-classifier")
 # Run inference
+preds = model(["i loved the spiderman movie!", "pineapple on pizza is the worst 🤮"])
 ```
+## BibTeX entry and citation info
+```bibtex
+@article{https://doi.org/10.48550/arxiv.2209.11055,
+doi = {10.48550/ARXIV.2209.11055},
+url = {https://arxiv.org/abs/2209.11055},
+author = {Tunstall, Lewis and Reimers, Nils and Jo, Unso Eun Seo and Bates, Luke and Korat, Daniel and Wasserblat, Moshe and Pereg, Oren},
+keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
+title = {Efficient Few-Shot Learning Without Prompts},
+publisher = {arXiv},
+year = {2022},
+copyright = {Creative Commons Attribution 4.0 International}
+}
+```

config.json CHANGED Viewed

@@ -1,5 +1,5 @@
 {
-  "_name_or_path": "/root/.cache/torch/sentence_transformers/sentence-transformers_paraphrase-mpnet-base-v2/",
   "architectures": [
     "MPNetModel"
   ],

 {
+  "_name_or_path": "/root/.cache/torch/sentence_transformers/Kaleemullah_paraphrase-mpnet-base-v2-ads-nonads-classifier/",
   "architectures": [
     "MPNetModel"
   ],

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:b3f1f75a5936d708c2e06911e77e15b1908bf8e75650ce8b3bb36c4c0cf53810
 size 437967672

 version https://git-lfs.github.com/spec/v1
+oid sha256:0faec4a3c5dd1650ceeb4068d6b6c862833cd0bc26904c900ca9bc057e7d5fbb
 size 437967672

model_head.pkl CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:6b8adb7b03dd706b76fe6f8fc5776b9b1e00e85954abe4d0a75bc61989efd030
 size 7706

 version https://git-lfs.github.com/spec/v1
+oid sha256:ce7c32d373f1de1eb51580dadc989970716a38e1e7c975fb683af0f456aaafd0
 size 7706

special_tokens_map.json CHANGED Viewed

@@ -9,7 +9,7 @@
   "cls_token": {
     "content": "<s>",
     "lstrip": false,
-    "normalized": true,
     "rstrip": false,
     "single_word": false
   },
@@ -37,7 +37,7 @@
   "sep_token": {
     "content": "</s>",
     "lstrip": false,
-    "normalized": true,
     "rstrip": false,
     "single_word": false
   },

   "cls_token": {
     "content": "<s>",
     "lstrip": false,
+    "normalized": false,
     "rstrip": false,
     "single_word": false
   },
   "sep_token": {
     "content": "</s>",
     "lstrip": false,
+    "normalized": false,
     "rstrip": false,
     "single_word": false
   },

tokenizer_config.json CHANGED Viewed

@@ -48,12 +48,19 @@
   "do_lower_case": true,
   "eos_token": "</s>",
   "mask_token": "<mask>",
   "model_max_length": 512,
   "never_split": null,
   "pad_token": "<pad>",
   "sep_token": "</s>",
   "strip_accents": null,
   "tokenize_chinese_chars": true,
   "tokenizer_class": "MPNetTokenizer",
   "unk_token": "[UNK]"
 }

   "do_lower_case": true,
   "eos_token": "</s>",
   "mask_token": "<mask>",
+  "max_length": 512,
   "model_max_length": 512,
   "never_split": null,
+  "pad_to_multiple_of": null,
   "pad_token": "<pad>",
+  "pad_token_type_id": 0,
+  "padding_side": "right",
   "sep_token": "</s>",
+  "stride": 0,
   "strip_accents": null,
   "tokenize_chinese_chars": true,
   "tokenizer_class": "MPNetTokenizer",
+  "truncation_side": "right",
+  "truncation_strategy": "longest_first",
   "unk_token": "[UNK]"
 }