Kaleemullah
commited on
Commit
•
8651128
1
Parent(s):
2b51a3c
Add SetFit model
Browse files- README.md +23 -43
- config.json +1 -1
- model.safetensors +1 -1
- model_head.pkl +1 -1
- special_tokens_map.json +2 -2
- tokenizer_config.json +7 -0
README.md
CHANGED
@@ -5,65 +5,45 @@ tags:
|
|
5 |
- sentence-transformers
|
6 |
- text-classification
|
7 |
pipeline_tag: text-classification
|
8 |
-
language:
|
9 |
-
- en
|
10 |
-
metrics:
|
11 |
-
- accuracy
|
12 |
---
|
13 |
|
14 |
# Kaleemullah/paraphrase-mpnet-base-v2-ads-nonads-classifier
|
15 |
|
16 |
-
|
17 |
-
The "paraphrase-mpnet-base-v2-ads-nonads-classifier" is a SetFit model tailored for efficient text classification, specifically designed to differentiate between ad-related and non-ad content. This model leverages advanced few-shot learning techniques, making it a powerful tool for applications in content analysis and digital marketing.
|
18 |
|
19 |
-
|
20 |
-
|
21 |
|
22 |
-
|
23 |
|
24 |
-
|
25 |
-
The model was trained on a comprehensive dataset consisting of various examples of ad-related and non-ad content sourced from multiple digital marketing platforms. The dataset includes a diverse range of linguistic styles and terminologies, ensuring robustness and versatility in the model's classification ability.
|
26 |
-
|
27 |
-
## Performance Metrics
|
28 |
-
The model performs excellently in text classification tasks, highlighted by its accuracy, precision, recall, and F1 score metrics. These metrics underscore the model's effectiveness in correctly identifying and classifying ad and non-ad content.
|
29 |
-
|
30 |
-
## Installation and Usage
|
31 |
-
|
32 |
-
To install the SetFit library, use one of the following commands:
|
33 |
|
34 |
```bash
|
35 |
-
|
36 |
-
# or
|
37 |
-
pip install setfit
|
38 |
```
|
39 |
|
40 |
-
|
41 |
-
You can then run the inference as follows:
|
42 |
|
43 |
```python
|
44 |
-
|
45 |
from setfit import SetFitModel
|
46 |
|
47 |
-
#
|
48 |
model = SetFitModel.from_pretrained("Kaleemullah/paraphrase-mpnet-base-v2-ads-nonads-classifier")
|
49 |
-
|
50 |
-
# Example texts for classification
|
51 |
-
texts = ["xiaomi phone best with best camera quality ", "Pineapple on pizza is the worst 🤮"]
|
52 |
-
|
53 |
# Run inference
|
54 |
-
|
55 |
-
predictions = model(texts)
|
56 |
-
|
57 |
-
|
58 |
-
# 'predictions' will be a list of labels corresponding to each input text
|
59 |
-
for text, pred in zip(texts, predictions):
|
60 |
-
print(f"Text: {text}\nPredicted Category: {'Ad' if pred else 'Non-Ad'}\n")
|
61 |
```
|
62 |
|
63 |
-
|
64 |
-
|
65 |
-
|
66 |
-
|
67 |
-
|
68 |
-
|
69 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
5 |
- sentence-transformers
|
6 |
- text-classification
|
7 |
pipeline_tag: text-classification
|
|
|
|
|
|
|
|
|
8 |
---
|
9 |
|
10 |
# Kaleemullah/paraphrase-mpnet-base-v2-ads-nonads-classifier
|
11 |
|
12 |
+
This is a [SetFit model](https://github.com/huggingface/setfit) that can be used for text classification. The model has been trained using an efficient few-shot learning technique that involves:
|
|
|
13 |
|
14 |
+
1. Fine-tuning a [Sentence Transformer](https://www.sbert.net) with contrastive learning.
|
15 |
+
2. Training a classification head with features from the fine-tuned Sentence Transformer.
|
16 |
|
17 |
+
## Usage
|
18 |
|
19 |
+
To use this model for inference, first install the SetFit library:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
20 |
|
21 |
```bash
|
22 |
+
python -m pip install setfit
|
|
|
|
|
23 |
```
|
24 |
|
25 |
+
You can then run inference as follows:
|
|
|
26 |
|
27 |
```python
|
|
|
28 |
from setfit import SetFitModel
|
29 |
|
30 |
+
# Download from Hub and run inference
|
31 |
model = SetFitModel.from_pretrained("Kaleemullah/paraphrase-mpnet-base-v2-ads-nonads-classifier")
|
|
|
|
|
|
|
|
|
32 |
# Run inference
|
33 |
+
preds = model(["i loved the spiderman movie!", "pineapple on pizza is the worst 🤮"])
|
|
|
|
|
|
|
|
|
|
|
|
|
34 |
```
|
35 |
|
36 |
+
## BibTeX entry and citation info
|
37 |
+
|
38 |
+
```bibtex
|
39 |
+
@article{https://doi.org/10.48550/arxiv.2209.11055,
|
40 |
+
doi = {10.48550/ARXIV.2209.11055},
|
41 |
+
url = {https://arxiv.org/abs/2209.11055},
|
42 |
+
author = {Tunstall, Lewis and Reimers, Nils and Jo, Unso Eun Seo and Bates, Luke and Korat, Daniel and Wasserblat, Moshe and Pereg, Oren},
|
43 |
+
keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
|
44 |
+
title = {Efficient Few-Shot Learning Without Prompts},
|
45 |
+
publisher = {arXiv},
|
46 |
+
year = {2022},
|
47 |
+
copyright = {Creative Commons Attribution 4.0 International}
|
48 |
+
}
|
49 |
+
```
|
config.json
CHANGED
@@ -1,5 +1,5 @@
|
|
1 |
{
|
2 |
-
"_name_or_path": "/root/.cache/torch/sentence_transformers/
|
3 |
"architectures": [
|
4 |
"MPNetModel"
|
5 |
],
|
|
|
1 |
{
|
2 |
+
"_name_or_path": "/root/.cache/torch/sentence_transformers/Kaleemullah_paraphrase-mpnet-base-v2-ads-nonads-classifier/",
|
3 |
"architectures": [
|
4 |
"MPNetModel"
|
5 |
],
|
model.safetensors
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
size 437967672
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:0faec4a3c5dd1650ceeb4068d6b6c862833cd0bc26904c900ca9bc057e7d5fbb
|
3 |
size 437967672
|
model_head.pkl
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
size 7706
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:ce7c32d373f1de1eb51580dadc989970716a38e1e7c975fb683af0f456aaafd0
|
3 |
size 7706
|
special_tokens_map.json
CHANGED
@@ -9,7 +9,7 @@
|
|
9 |
"cls_token": {
|
10 |
"content": "<s>",
|
11 |
"lstrip": false,
|
12 |
-
"normalized":
|
13 |
"rstrip": false,
|
14 |
"single_word": false
|
15 |
},
|
@@ -37,7 +37,7 @@
|
|
37 |
"sep_token": {
|
38 |
"content": "</s>",
|
39 |
"lstrip": false,
|
40 |
-
"normalized":
|
41 |
"rstrip": false,
|
42 |
"single_word": false
|
43 |
},
|
|
|
9 |
"cls_token": {
|
10 |
"content": "<s>",
|
11 |
"lstrip": false,
|
12 |
+
"normalized": false,
|
13 |
"rstrip": false,
|
14 |
"single_word": false
|
15 |
},
|
|
|
37 |
"sep_token": {
|
38 |
"content": "</s>",
|
39 |
"lstrip": false,
|
40 |
+
"normalized": false,
|
41 |
"rstrip": false,
|
42 |
"single_word": false
|
43 |
},
|
tokenizer_config.json
CHANGED
@@ -48,12 +48,19 @@
|
|
48 |
"do_lower_case": true,
|
49 |
"eos_token": "</s>",
|
50 |
"mask_token": "<mask>",
|
|
|
51 |
"model_max_length": 512,
|
52 |
"never_split": null,
|
|
|
53 |
"pad_token": "<pad>",
|
|
|
|
|
54 |
"sep_token": "</s>",
|
|
|
55 |
"strip_accents": null,
|
56 |
"tokenize_chinese_chars": true,
|
57 |
"tokenizer_class": "MPNetTokenizer",
|
|
|
|
|
58 |
"unk_token": "[UNK]"
|
59 |
}
|
|
|
48 |
"do_lower_case": true,
|
49 |
"eos_token": "</s>",
|
50 |
"mask_token": "<mask>",
|
51 |
+
"max_length": 512,
|
52 |
"model_max_length": 512,
|
53 |
"never_split": null,
|
54 |
+
"pad_to_multiple_of": null,
|
55 |
"pad_token": "<pad>",
|
56 |
+
"pad_token_type_id": 0,
|
57 |
+
"padding_side": "right",
|
58 |
"sep_token": "</s>",
|
59 |
+
"stride": 0,
|
60 |
"strip_accents": null,
|
61 |
"tokenize_chinese_chars": true,
|
62 |
"tokenizer_class": "MPNetTokenizer",
|
63 |
+
"truncation_side": "right",
|
64 |
+
"truncation_strategy": "longest_first",
|
65 |
"unk_token": "[UNK]"
|
66 |
}
|