sahajrajmalla commited on
Commit
dcd41c6
1 Parent(s): f3733f7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +131 -0
README.md CHANGED
@@ -1,3 +1,134 @@
1
  ---
2
  license: mit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ tags:
4
+ - nepali-nlp, nepali-news-classificiation, nlp, transformers
5
+ model-index:
6
+ - name: patrakar
7
+ results: []
8
+ widget:
9
+
10
+ - text: "नेकपा (एमाले)का नेता गोकर्णराज विष्टले सहमति र सहकार्यबाटै संविधान बनाउने तथा जनताको जीवनस्तर उकास्ने काम गर्नु नै अबको मुख्य काम रहेको बताएका छन् ।"
11
+ example_title: "Example 1"
12
+ - text: "राजनीतिक स्थिरता नहुँदा विकास निर्माणले गति लिन सकेन"
13
+ example_title: "Example 2"
14
+ - text: "छाउगोठ भत्काइदिए फेरि बनाउने, बनाउन नपाए ओडार वा बारीका कान्लामा रात बिताउने र ज्यानकै जोखिम मोल्न तयार हुने प्रवृत्तिबाट थाहा हुन्छ– छाउपडी प्रथा हटाउनका लागि बनाइएका अहिलेसम्मका योजना, रणनीति उपयुक्त छैनन् र गरिएको लगानी खेर गइरहेको छ"
15
+ example_title: "Example 3"
16
+
17
  ---
18
+
19
+ # patrakar/ पत्रकार (Nepali News Classifier)
20
+
21
+ Last updated: September 2022
22
+
23
+
24
+ DistilBERT model with on 9 newsgroup datasets for the Nepali language with 95.475% accuracy.
25
+ ## Model Details
26
+
27
+ patrakar is a DistilBERT pre-trained sequence classification transformer model which classifies Nepali language news into 9 newsgroup category, such as:
28
+
29
+ - politics
30
+ - opinion
31
+ - bank
32
+ - entertainment
33
+ - economy
34
+ - health
35
+ - literature
36
+ - sports
37
+ - tourism
38
+
39
+ It is developed by Sahaj Raj Malla to be generally usefuly for general public and so that others could explore them for commercial and scientific purposes. This model was trained on [Sakonii/distilgpt2-nepali](https://huggingface.co/Sakonii/distilgpt2-nepali) model.
40
+
41
+ It achieves the following results on the test dataset:
42
+
43
+ | Total Number of samples | Accuracy(%)
44
+ |:-------------:|:---------------:
45
+ | 5670 | 95.475
46
+
47
+ ### Model date
48
+ September 2022
49
+
50
+ ### Model type
51
+ Sequence classification model
52
+
53
+ ### Model version
54
+ 1.0.0
55
+
56
+ ## Model Usage
57
+ This model can be used directly with a pipeline for text generation. Since the generation relies on some randomness, we set a seed for reproducibility:
58
+
59
+ ```python
60
+
61
+ from transformers import pipeline, set_seed
62
+
63
+ set_seed(42)
64
+
65
+ classifier = pipeline('text-classification', model=model_name)
66
+
67
+ text = "नेकपा (एमाले)का नेता गोकर्णराज विष्टले सहमति र सहकार्यबाटै संविधान बनाउने तथा जनताको जीवनस्तर उकास्ने काम गर्नु नै अबको मुख्य काम रहेको बताएका छन् ।"
68
+
69
+ classifier(text)
70
+ ```
71
+
72
+
73
+ Here is how we can use the model to get the features of a given text in PyTorch:
74
+ ```python
75
+ !pip install transformers pytorch
76
+
77
+ from transformers import AutoTokenizer
78
+ from transformers import AutoModelForSequenceClassification
79
+
80
+ import torch
81
+ import torch.nn.functional as F
82
+
83
+
84
+
85
+ # initializing model and tokenizer
86
+ model_name = "sahajrajmalla/patrakar"
87
+
88
+ # downloading tokenizer
89
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
90
+
91
+ # downloading model
92
+ model = AutoModelForSequenceClassification.from_pretrained(model_name)
93
+
94
+ def tokenize_function(examples):
95
+ return tokenizer(examples["data"], padding="max_length", truncation=True)
96
+
97
+
98
+ # predicting with the model
99
+ word_i_want_to_predict = "राजनीतिक स्थिरता नहुँदा विकास निर्माणले गति लिन सकेन"
100
+
101
+ # initializing our labels
102
+ label_list = [
103
+ "bank",
104
+ "economy",
105
+ "entertainment",
106
+ "health",
107
+ "literature",
108
+ "opinion",
109
+ "politics",
110
+ "sports",
111
+ "tourism"
112
+ ]
113
+
114
+ batch = tokenizer(word_i_want_to_predict, padding=True, truncation=True, max_length=512, return_tensors='pt')
115
+
116
+ with torch.no_grad():
117
+ outputs = model(**batch)
118
+ predictions = F.softmax(outputs.logits, dim=1)
119
+ labels = torch.argmax(predictions, dim=1)
120
+
121
+ print(f"The sequence: \n\n {word_i_want_to_predict} \n\n is predicted to be of newsgroup {label_list[labels.item()]}")
122
+ ```
123
+
124
+ ## Training data
125
+ This model is trained on 50,945 rows of Nepali language news grouped [dataset](https://www.kaggle.com/competitions/text-it-meet-22/data?select=train.csv) found on Kaggle which was also used in IT Meet 2022 Text challenge.
126
+
127
+ ##
128
+
129
+
130
+ ## Framework versions
131
+ - Transformers 4.20.1
132
+ - Pytorch 1.9.1
133
+ - Datasets 2.0.0
134
+ - Tokenizers 0.11.6