File size: 6,204 Bytes
7086247 3d568e9 7086247 95d6590 2291afb 95d6590 3d568e9 554a75a 9f30a52 95d6590 fb29812 95d6590 13587d6 554a75a 441a814 554a75a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 |
---
license: apache-2.0
datasets:
- Novora/CodeClassifier_v1
pipeline_tag: text-classification
---
# Introduction
Novora Code Classifier v1 Tiny, is a tiny `Text Classification` model, which classifies given code text input under 1 of `31` different classes (programming languages).
This model is designed to be able to run on CPU, but optimally runs on GPUs.
# Info
- 1 of 31 classes output
- 512 token input dimension
- 64 hidden dimensions
- 2 linear layers
- The `snowflake-arctic-embed-xs` model is used as the embeddings model.
- Dataset split into 80% training set, 20% testing set.
- The combined test and training data is around 1000 chunks per programming language, the data is 31,100 chunks (entries) as 512 tokens per chunk, being a snippet of the code.
- Picked from the 18th epoch out of 20 done.
# Architecture
The `CodeClassifier-v1-Tiny` model employs a neural network architecture optimized for text classification tasks, specifically for classifying programming languages from code snippets. This model includes:
- **Bidirectional LSTM Feature Extractor**: This bidirectional LSTM layer processes input embeddings, effectively capturing contextual relationships in both forward and reverse directions within the code snippets.
- **Fully Connected Layers**: The network includes two linear layers. The first projects the pooled features into a hidden feature space, and the second linear layer maps these to the output classes, which correspond to different programming languages. A dropout layer with a rate of 0.5 between these layers helps mitigate overfitting.
The model's bidirectional nature and architectural components make it adept at understanding the syntax and structure crucial for code classification.
# Testing/Training Datasets
I have put here the samples entered into the training/testing pipeline, its a very small amount.
| Language | Testing Count | Training Count |
|--------------|---------------|----------------|
| Ada | 20 | 80 |
| Assembly | 20 | 80 |
| C | 20 | 80 |
| C# | 20 | 80 |
| C++ | 20 | 80 |
| COBOL | 14 | 55 |
| Common Lisp | 20 | 80 |
| Dart | 20 | 80 |
| Erlang | 20 | 80 |
| F# | 20 | 80 |
| Go | 20 | 80 |
| Haskell | 20 | 80 |
| Java | 20 | 80 |
| JavaScript | 20 | 80 |
| Julia | 20 | 80 |
| Kotlin | 20 | 80 |
| Lua | 20 | 80 |
| MATLAB | 20 | 80 |
| PHP | 20 | 80 |
| Perl | 20 | 80 |
| Prolog | 1 | 4 |
| Python | 20 | 80 |
| R | 20 | 80 |
| Ruby | 20 | 80 |
| Rust | 20 | 80 |
| SQL | 20 | 80 |
| Scala | 20 | 80 |
| Swift | 20 | 80 |
| TypeScript | 20 | 80 |
# Example Code
```python
import torch.nn as nn
import torch.nn.functional as F
class CodeClassifier(nn.Module):
def __init__(self, num_classes, embedding_dim, hidden_dim, num_layers, bidirectional=False):
super(CodeClassifier, self).__init__()
self.feature_extractor = nn.LSTM(embedding_dim, hidden_dim, num_layers, batch_first=True, bidirectional=bidirectional)
self.dropout = nn.Dropout(0.5) # Reintroduce dropout
self.fc1 = nn.Linear(hidden_dim * (2 if bidirectional else 1), hidden_dim) # Intermediate layer
self.fc2 = nn.Linear(hidden_dim, num_classes) # Output layer
def forward(self, x):
x = x.unsqueeze(1) # Add sequence dimension
x, _ = self.feature_extractor(x)
x = x.squeeze(1) # Remove sequence dimension
x = self.fc1(x)
x = self.dropout(x) # Apply dropout
x = self.fc2(x)
return x
import torch
from transformers import AutoTokenizer, AutoModel
from pathlib import Path
def infer(text, model_path, embedding_model_name):
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Load tokenizer and embedding model
tokenizer = AutoTokenizer.from_pretrained(embedding_model_name)
embedding_model = AutoModel.from_pretrained(embedding_model_name).to(device)
embedding_model.eval()
# Prepare inputs
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
inputs = {k: v.to(device) for k, v in inputs.items()}
# Generate embeddings
with torch.no_grad():
embeddings = embedding_model(**inputs)[0][:, 0]
# Load classifier model
model = CodeClassifier(num_classes=31, embedding_dim=embeddings.size(-1), hidden_dim=64, num_layers=2, bidirectional=True)
model.load_state_dict(torch.load(model_path, map_location=device))
model = model.to(device)
model.eval()
# Predict class
with torch.no_grad():
output = model(embeddings)
_, predicted = torch.max(output, dim=1)
# Language labels
languages = [
'Ada', 'Assembly', 'C', 'C#', 'C++', 'COBOL', 'Common Lisp', 'Dart', 'Erlang', 'F#',
'Fortran', 'Go', 'Haskell', 'Java', 'JavaScript', 'Julia', 'Kotlin', 'Lua', 'MATLAB',
'Objective-C', 'PHP', 'Perl', 'Prolog', 'Python', 'R', 'Ruby', 'Rust', 'SQL', 'Scala',
'Swift', 'TypeScript'
]
return languages[predicted.item()]
# Example usage
if __name__ == "__main__":
example_text = "print('Hello, world!')" # Replace with actual text for inference
model_file_path = Path("./model.safetensors")
predicted_language = infer(example_text, model_file_path, "Snowflake/snowflake-arctic-embed-xs")
print(f"Predicted programming language: {predicted_language}")
```
|