Sentence Similarity
setfit
PyTorch
bert
feature-extraction
e5
File size: 1,016 Bytes
cb36cd8
f65aba2
 
 
 
5da8d2e
 
 
 
 
 
cb36cd8
f65aba2
3a1c4d1
 
5da8d2e
3a1c4d1
f65aba2
 
 
5da8d2e
f65aba2
5da8d2e
 
f65aba2
5da8d2e
3a1c4d1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
---
pipeline_tag: sentence-similarity
tags:
- feature-extraction
- sentence-similarity
- setfit
- e5
license: mit
datasets:
- KnutJaegersberg/wikipedia_categories
- KnutJaegersberg/wikipedia_categories_labels
---

This English model (e5-large as basis) predicts wikipedia categories (roundabout 37 labels). It is trained on the concatenation of the headlines of the lower level categories articles in few shot setting (i.e. 8 subcategories with their headline concatenations per level 2 category). 
Accuracy on test data split is 85 %. 
Note that these numbers are just an indicator that training worked, it will differ in production settings, which is why this classifier is meant for corpus exploration.  
Use the wikipedia_categories_labels dataset as key.



from setfit import SetFitModel

Download from Hub and run inference
model = SetFitModel.from_pretrained("KnutJaegersberg/wikipedia_categories_setfit")

Run inference
preds = model(["i loved the spiderman movie!", "pineapple on pizza is the worst 🤮"])