Kaleemullah commited on
Commit
8651128
1 Parent(s): 2b51a3c

Add SetFit model

Browse files
README.md CHANGED
@@ -5,65 +5,45 @@ tags:
5
  - sentence-transformers
6
  - text-classification
7
  pipeline_tag: text-classification
8
- language:
9
- - en
10
- metrics:
11
- - accuracy
12
  ---
13
 
14
  # Kaleemullah/paraphrase-mpnet-base-v2-ads-nonads-classifier
15
 
16
- ## Introduction
17
- The "paraphrase-mpnet-base-v2-ads-nonads-classifier" is a SetFit model tailored for efficient text classification, specifically designed to differentiate between ad-related and non-ad content. This model leverages advanced few-shot learning techniques, making it a powerful tool for applications in content analysis and digital marketing.
18
 
19
- ## Model Description
20
- This model is built upon a Sentence Transformer, fine-tuned through contrastive learning for enhanced feature extraction capabilities. It incorporates a classification head trained with these fine-tuned features, enabling high accuracy and efficiency in classifying texts into ad-related or non-ad categories. The model is particularly effective in processing and understanding the subtle differences in language used in advertising content.
21
 
22
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/61b2c130ac5ecaae3d1efe27/zS37gyEQW7MhP9CwWb7zZ.png)
23
 
24
- ## Training Data
25
- The model was trained on a comprehensive dataset consisting of various examples of ad-related and non-ad content sourced from multiple digital marketing platforms. The dataset includes a diverse range of linguistic styles and terminologies, ensuring robustness and versatility in the model's classification ability.
26
-
27
- ## Performance Metrics
28
- The model performs excellently in text classification tasks, highlighted by its accuracy, precision, recall, and F1 score metrics. These metrics underscore the model's effectiveness in correctly identifying and classifying ad and non-ad content.
29
-
30
- ## Installation and Usage
31
-
32
- To install the SetFit library, use one of the following commands:
33
 
34
  ```bash
35
- python3 -m pip install setfit
36
- # or
37
- pip install setfit
38
  ```
39
 
40
-
41
- You can then run the inference as follows:
42
 
43
  ```python
44
-
45
  from setfit import SetFitModel
46
 
47
- # Initialize and download the model from the HuggingFace Hub
48
  model = SetFitModel.from_pretrained("Kaleemullah/paraphrase-mpnet-base-v2-ads-nonads-classifier")
49
-
50
- # Example texts for classification
51
- texts = ["xiaomi phone best with best camera quality ", "Pineapple on pizza is the worst 🤮"]
52
-
53
  # Run inference
54
- # This will classify each text in the 'texts' list as ad-related or non-ad content
55
- predictions = model(texts)
56
-
57
-
58
- # 'predictions' will be a list of labels corresponding to each input text
59
- for text, pred in zip(texts, predictions):
60
- print(f"Text: {text}\nPredicted Category: {'Ad' if pred else 'Non-Ad'}\n")
61
  ```
62
 
63
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/61b2c130ac5ecaae3d1efe27/63Flu2lnmR6XS6E2hm9MY.png)
64
-
65
- ## Some Other Non-Ads Content Example
66
-
67
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/61b2c130ac5ecaae3d1efe27/rZzbGFWXb4tYRhcO46LB7.png)
68
-
69
-
 
 
 
 
 
 
 
 
5
  - sentence-transformers
6
  - text-classification
7
  pipeline_tag: text-classification
 
 
 
 
8
  ---
9
 
10
  # Kaleemullah/paraphrase-mpnet-base-v2-ads-nonads-classifier
11
 
12
+ This is a [SetFit model](https://github.com/huggingface/setfit) that can be used for text classification. The model has been trained using an efficient few-shot learning technique that involves:
 
13
 
14
+ 1. Fine-tuning a [Sentence Transformer](https://www.sbert.net) with contrastive learning.
15
+ 2. Training a classification head with features from the fine-tuned Sentence Transformer.
16
 
17
+ ## Usage
18
 
19
+ To use this model for inference, first install the SetFit library:
 
 
 
 
 
 
 
 
20
 
21
  ```bash
22
+ python -m pip install setfit
 
 
23
  ```
24
 
25
+ You can then run inference as follows:
 
26
 
27
  ```python
 
28
  from setfit import SetFitModel
29
 
30
+ # Download from Hub and run inference
31
  model = SetFitModel.from_pretrained("Kaleemullah/paraphrase-mpnet-base-v2-ads-nonads-classifier")
 
 
 
 
32
  # Run inference
33
+ preds = model(["i loved the spiderman movie!", "pineapple on pizza is the worst 🤮"])
 
 
 
 
 
 
34
  ```
35
 
36
+ ## BibTeX entry and citation info
37
+
38
+ ```bibtex
39
+ @article{https://doi.org/10.48550/arxiv.2209.11055,
40
+ doi = {10.48550/ARXIV.2209.11055},
41
+ url = {https://arxiv.org/abs/2209.11055},
42
+ author = {Tunstall, Lewis and Reimers, Nils and Jo, Unso Eun Seo and Bates, Luke and Korat, Daniel and Wasserblat, Moshe and Pereg, Oren},
43
+ keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
44
+ title = {Efficient Few-Shot Learning Without Prompts},
45
+ publisher = {arXiv},
46
+ year = {2022},
47
+ copyright = {Creative Commons Attribution 4.0 International}
48
+ }
49
+ ```
config.json CHANGED
@@ -1,5 +1,5 @@
1
  {
2
- "_name_or_path": "/root/.cache/torch/sentence_transformers/sentence-transformers_paraphrase-mpnet-base-v2/",
3
  "architectures": [
4
  "MPNetModel"
5
  ],
 
1
  {
2
+ "_name_or_path": "/root/.cache/torch/sentence_transformers/Kaleemullah_paraphrase-mpnet-base-v2-ads-nonads-classifier/",
3
  "architectures": [
4
  "MPNetModel"
5
  ],
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:b3f1f75a5936d708c2e06911e77e15b1908bf8e75650ce8b3bb36c4c0cf53810
3
  size 437967672
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0faec4a3c5dd1650ceeb4068d6b6c862833cd0bc26904c900ca9bc057e7d5fbb
3
  size 437967672
model_head.pkl CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:6b8adb7b03dd706b76fe6f8fc5776b9b1e00e85954abe4d0a75bc61989efd030
3
  size 7706
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ce7c32d373f1de1eb51580dadc989970716a38e1e7c975fb683af0f456aaafd0
3
  size 7706
special_tokens_map.json CHANGED
@@ -9,7 +9,7 @@
9
  "cls_token": {
10
  "content": "<s>",
11
  "lstrip": false,
12
- "normalized": true,
13
  "rstrip": false,
14
  "single_word": false
15
  },
@@ -37,7 +37,7 @@
37
  "sep_token": {
38
  "content": "</s>",
39
  "lstrip": false,
40
- "normalized": true,
41
  "rstrip": false,
42
  "single_word": false
43
  },
 
9
  "cls_token": {
10
  "content": "<s>",
11
  "lstrip": false,
12
+ "normalized": false,
13
  "rstrip": false,
14
  "single_word": false
15
  },
 
37
  "sep_token": {
38
  "content": "</s>",
39
  "lstrip": false,
40
+ "normalized": false,
41
  "rstrip": false,
42
  "single_word": false
43
  },
tokenizer_config.json CHANGED
@@ -48,12 +48,19 @@
48
  "do_lower_case": true,
49
  "eos_token": "</s>",
50
  "mask_token": "<mask>",
 
51
  "model_max_length": 512,
52
  "never_split": null,
 
53
  "pad_token": "<pad>",
 
 
54
  "sep_token": "</s>",
 
55
  "strip_accents": null,
56
  "tokenize_chinese_chars": true,
57
  "tokenizer_class": "MPNetTokenizer",
 
 
58
  "unk_token": "[UNK]"
59
  }
 
48
  "do_lower_case": true,
49
  "eos_token": "</s>",
50
  "mask_token": "<mask>",
51
+ "max_length": 512,
52
  "model_max_length": 512,
53
  "never_split": null,
54
+ "pad_to_multiple_of": null,
55
  "pad_token": "<pad>",
56
+ "pad_token_type_id": 0,
57
+ "padding_side": "right",
58
  "sep_token": "</s>",
59
+ "stride": 0,
60
  "strip_accents": null,
61
  "tokenize_chinese_chars": true,
62
  "tokenizer_class": "MPNetTokenizer",
63
+ "truncation_side": "right",
64
+ "truncation_strategy": "longest_first",
65
  "unk_token": "[UNK]"
66
  }