kenhktsui
/

code-natural-language-fasttext-classifier

Text Classification

Model card Files Files and versions Community

kenhktsui commited on Oct 30, 2024

Commit

c2ba415

·

verified ·

1 Parent(s): ec97cd8

Update README.md

Files changed (1) hide show

README.md +92 -3

README.md CHANGED Viewed

@@ -1,3 +1,92 @@
----
-license: mit
----

+---
+license: mit
+datasets:
+- kenhktsui/code-natural-language-classification-dataset
+language:
+- en
+metrics:
+- f1
+pipeline_tag: text-classification
+library_name: fasttext
+---
+# code-natural-language-classification-dataset
+[Dataset](https://huggingface.co/datasets/kenhktsui/code-natural-language-classification-dataset)
+This classifier classifies a text into Code or NaturalLanguage.
+The model is trained over 3.24M records, which is a mix of code and natural langauge and achieved a test F1 score of 0.99.
+The classifier can be used for LLM pretraining data curation, to route a text into different pipeline (e.g. code syntax check).
+It is ultra fast ⚡ with a throughtput of ~2000 doc/s with CPU.
+## 🛠️Usage
+```python
+from typing import List
+import re
+from huggingface_hub import hf_hub_download
+import fasttext
+model_hf = fasttext.load_model(hf_hub_download("kenhktsui/code-natural-language-fasttext-classifier", "model.bin"))
+def replace_newlines(text: str) -> str:
+  return re.sub("\n+", " ", text)
+def predict(text_list: List[str]) -> List[dict]:
+  text_list = [replace_newlines(text) for text in text_list]
+  pred = model.predict(text_list)
+  return [{"label": l[0].lstrip("__label__"), "score": s[0]}
+           for l, s in zip(*pred)]
+predict([
+  """This is a lightning fast model, which can classify at throughtput of 2000 doc/s with CPU""",
+  """import torch""",
+  """Short text won't work"""
+])
+# [{'label': 'NaturalLanguage', 'score': 0.96747404},
+# {'label': 'Code', 'score': 1.00001},
+# {'label': 'Code', 'score': 1.000009}]
+```
+## 📝Definition of Label
+Code covers:
+```
+{'Assembly',
+ 'Batchfile',
+ 'C',
+ 'C#',
+ 'C++',
+ 'CMake',
+ 'CSS',
+ 'Dockerfile',
+ 'FORTRAN',
+ 'GO',
+ 'HTML',
+ 'Haskell',
+ 'Java',
+ 'JavaScript',
+ 'Julia',
+ 'Lua',
+ 'Makefile',
+ 'PHP',
+ 'Perl',
+ 'PowerShell',
+ 'Python',
+ 'Ruby',
+ 'Rust',
+ 'SQL',
+ 'Scala',
+ 'Shell',
+ 'TeX',
+ 'TypeScript',
+ 'Visual Basic'}
+```
+NaturalLanguage covers Markdown, which has a high overlap with natural language.
+## ⚠️Known Limitation
+The classifier does not handle short text well, which might not be surprising.
+It has a tendency to classify short natural language into code, which you might find so in code comment.