|
--- |
|
license: mit |
|
tags: |
|
- nepali-nlp, nepali-news-classificiation, nlp, transformers |
|
model-index: |
|
- name: patrakar |
|
results: [] |
|
widget: |
|
|
|
- text: "नेकपा (एमाले)का नेता गोकर्णराज विष्टले सहमति र सहकार्यबाटै संविधान बनाउने तथा जनताको जीवनस्तर उकास्ने काम गर्नु नै अबको मुख्य काम रहेको बताएका छन् ।" |
|
example_title: "Example 1" |
|
- text: "राजनीतिक स्थिरता नहुँदा विकास निर्माणले गति लिन सकेन" |
|
example_title: "Example 2" |
|
- text: "छाउगोठ भत्काइदिए फेरि बनाउने, बनाउन नपाए ओडार वा बारीका कान्लामा रात बिताउने र ज्यानकै जोखिम मोल्न तयार हुने प्रवृत्तिबाट थाहा हुन्छ– छाउपडी प्रथा हटाउनका लागि बनाइएका अहिलेसम्मका योजना, रणनीति उपयुक्त छैनन् र गरिएको लगानी खेर गइरहेको छ" |
|
example_title: "Example 3" |
|
|
|
--- |
|
|
|
# patrakar/ पत्रकार (Nepali News Classifier) |
|
|
|
Last updated: September 2022 |
|
|
|
|
|
DistilBERT model with on 9 newsgroup datasets for the Nepali language with 95.475% accuracy. |
|
## Model Details |
|
|
|
patrakar is a DistilBERT pre-trained sequence classification transformer model which classifies Nepali language news into 9 newsgroup category, such as: |
|
|
|
- politics |
|
- opinion |
|
- bank |
|
- entertainment |
|
- economy |
|
- health |
|
- literature |
|
- sports |
|
- tourism |
|
|
|
It is developed by Sahaj Raj Malla to be generally usefuly for general public and so that others could explore them for commercial and scientific purposes. This model was trained on [Sakonii/distilgpt2-nepali](https://huggingface.co/Sakonii/distilgpt2-nepali) model. |
|
|
|
It achieves the following results on the test dataset: |
|
|
|
| Total Number of samples | Accuracy(%) |
|
|:-------------:|:---------------: |
|
| 5670 | 95.475 |
|
|
|
### Model date |
|
September 2022 |
|
|
|
### Model type |
|
Sequence classification model |
|
|
|
### Model version |
|
1.0.0 |
|
|
|
## Model Usage |
|
This model can be used directly with a pipeline for text generation. Since the generation relies on some randomness, we set a seed for reproducibility: |
|
|
|
```python |
|
|
|
from transformers import pipeline, set_seed |
|
|
|
set_seed(42) |
|
|
|
model_name = "sahajrajmalla/patrakar" |
|
classifier = pipeline('text-classification', model=model_name) |
|
|
|
text = "नेकपा (एमाले)का नेता गोकर्णराज विष्टले सहमति र सहकार्यबाटै संविधान बनाउने तथा जनताको जीवनस्तर उकास्ने काम गर्नु नै अबको मुख्य काम रहेको बताएका छन् ।" |
|
|
|
classifier(text) |
|
``` |
|
|
|
|
|
Here is how we can use the model to get the features of a given text in PyTorch: |
|
```python |
|
!pip install transformers pytorch |
|
|
|
from transformers import AutoTokenizer |
|
from transformers import AutoModelForSequenceClassification |
|
|
|
import torch |
|
import torch.nn.functional as F |
|
|
|
|
|
|
|
# initializing model and tokenizer |
|
model_name = "sahajrajmalla/patrakar" |
|
|
|
# downloading tokenizer |
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
|
# downloading model |
|
model = AutoModelForSequenceClassification.from_pretrained(model_name) |
|
|
|
def tokenize_function(examples): |
|
return tokenizer(examples["data"], padding="max_length", truncation=True) |
|
|
|
|
|
# predicting with the model |
|
word_i_want_to_predict = "राजनीतिक स्थिरता नहुँदा विकास निर्माणले गति लिन सकेन" |
|
|
|
# initializing our labels |
|
label_list = [ |
|
"bank", |
|
"economy", |
|
"entertainment", |
|
"health", |
|
"literature", |
|
"opinion", |
|
"politics", |
|
"sports", |
|
"tourism" |
|
] |
|
|
|
batch = tokenizer(word_i_want_to_predict, padding=True, truncation=True, max_length=512, return_tensors='pt') |
|
|
|
with torch.no_grad(): |
|
outputs = model(**batch) |
|
predictions = F.softmax(outputs.logits, dim=1) |
|
labels = torch.argmax(predictions, dim=1) |
|
|
|
print(f"The sequence: \n\n {word_i_want_to_predict} \n\n is predicted to be of newsgroup {label_list[labels.item()]}") |
|
``` |
|
|
|
## Training data |
|
This model is trained on 50,945 rows of Nepali language news grouped [dataset](https://www.kaggle.com/competitions/text-it-meet-22/data?select=train.csv) found on Kaggle which was also used in IT Meet 2022 Text challenge. |
|
|
|
## |
|
|
|
|
|
## Framework versions |
|
- Transformers 4.20.1 |
|
- Pytorch 1.9.1 |
|
- Datasets 2.0.0 |
|
- Tokenizers 0.11.6 |