kenhktsui commited on
Commit
c2ba415
·
verified ·
1 Parent(s): ec97cd8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +92 -3
README.md CHANGED
@@ -1,3 +1,92 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - kenhktsui/code-natural-language-classification-dataset
5
+ language:
6
+ - en
7
+ metrics:
8
+ - f1
9
+ pipeline_tag: text-classification
10
+ library_name: fasttext
11
+ ---
12
+ # code-natural-language-classification-dataset
13
+
14
+ [Dataset](https://huggingface.co/datasets/kenhktsui/code-natural-language-classification-dataset)
15
+
16
+ This classifier classifies a text into Code or NaturalLanguage.
17
+ The model is trained over 3.24M records, which is a mix of code and natural langauge and achieved a test F1 score of 0.99.
18
+ The classifier can be used for LLM pretraining data curation, to route a text into different pipeline (e.g. code syntax check).
19
+ It is ultra fast ⚡ with a throughtput of ~2000 doc/s with CPU.
20
+
21
+
22
+ ## 🛠️Usage
23
+ ```python
24
+ from typing import List
25
+ import re
26
+ from huggingface_hub import hf_hub_download
27
+ import fasttext
28
+
29
+
30
+ model_hf = fasttext.load_model(hf_hub_download("kenhktsui/code-natural-language-fasttext-classifier", "model.bin"))
31
+
32
+
33
+ def replace_newlines(text: str) -> str:
34
+ return re.sub("\n+", " ", text)
35
+
36
+
37
+ def predict(text_list: List[str]) -> List[dict]:
38
+ text_list = [replace_newlines(text) for text in text_list]
39
+ pred = model.predict(text_list)
40
+ return [{"label": l[0].lstrip("__label__"), "score": s[0]}
41
+ for l, s in zip(*pred)]
42
+
43
+
44
+ predict([
45
+ """This is a lightning fast model, which can classify at throughtput of 2000 doc/s with CPU""",
46
+ """import torch""",
47
+ """Short text won't work"""
48
+ ])
49
+ # [{'label': 'NaturalLanguage', 'score': 0.96747404},
50
+ # {'label': 'Code', 'score': 1.00001},
51
+ # {'label': 'Code', 'score': 1.000009}]
52
+ ```
53
+
54
+
55
+ ## 📝Definition of Label
56
+ Code covers:
57
+ ```
58
+ {'Assembly',
59
+ 'Batchfile',
60
+ 'C',
61
+ 'C#',
62
+ 'C++',
63
+ 'CMake',
64
+ 'CSS',
65
+ 'Dockerfile',
66
+ 'FORTRAN',
67
+ 'GO',
68
+ 'HTML',
69
+ 'Haskell',
70
+ 'Java',
71
+ 'JavaScript',
72
+ 'Julia',
73
+ 'Lua',
74
+ 'Makefile',
75
+ 'PHP',
76
+ 'Perl',
77
+ 'PowerShell',
78
+ 'Python',
79
+ 'Ruby',
80
+ 'Rust',
81
+ 'SQL',
82
+ 'Scala',
83
+ 'Shell',
84
+ 'TeX',
85
+ 'TypeScript',
86
+ 'Visual Basic'}
87
+ ```
88
+ NaturalLanguage covers Markdown, which has a high overlap with natural language.
89
+
90
+ ## ⚠️Known Limitation
91
+ The classifier does not handle short text well, which might not be surprising.
92
+ It has a tendency to classify short natural language into code, which you might find so in code comment.