Text Classification
Transformers
Safetensors
Sinhala
roberta
Inference Endpoints
tharindu commited on
Commit
7b277c4
1 Parent(s): b45a997

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +39 -0
README.md ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-sa-4.0
3
+ datasets:
4
+ - sinhala-nlp/NSINA
5
+ - sinhala-nlp/NSINA-Categories
6
+ language:
7
+ - si
8
+ ---
9
+
10
+ # Sinhala News Category Prediction
11
+ This is a text classification task created with the [NSINA dataset](https://github.com/Sinhala-NLP/NSINA). This dataset is also released with the same license as NSINA. Given the news content, the ML models should predict a pre-defined category for the news.
12
+
13
+
14
+ ## Data
15
+ First, for this task, we dropped all the news articles in NSINA 1.0 without a category as some news sources prefer not to categorise them. Next, we identified the top 100 news categories from the available news instances. We grouped possible categories into four main categories: local news, international news, sports news, and business news. To avoid bias, we undersampled the dataset. We only used 10,000 instances from each category, and for the ``Business" category, we used the original number of instances which was 8777 articles. We divided this dataset into a training and test set following a 0.8 split
16
+ Data can be loaded into pandas dataframes using the following code.
17
+
18
+ ```python
19
+ from datasets import Dataset
20
+ from datasets import load_dataset
21
+
22
+ train = Dataset.to_pandas(load_dataset('sinhala-nlp/NSINA-Categories', split='train'))
23
+ test = Dataset.to_pandas(load_dataset('sinhala-nlp/NSINA-Categories', split='test'))
24
+ ```
25
+
26
+
27
+
28
+ ## Citation
29
+ If you are using the dataset or the models, please cite the following paper.
30
+
31
+ ~~~
32
+ @inproceedings{Nsina2024,
33
+ author={Hettiarachchi, Hansi and Premasiri, Damith and Uyangodage, Lasitha and Ranasinghe, Tharindu},
34
+ title={{NSINA: A News Corpus for Sinhala}},
35
+ booktitle={The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
36
+ year={2024},
37
+ month={May},
38
+ }
39
+ ~~~