File size: 6,204 Bytes
7086247
 
3d568e9
 
 
7086247
95d6590
 
 
 
 
 
 
 
 
 
2291afb
95d6590
 
3d568e9
554a75a
9f30a52
95d6590
 
 
fb29812
95d6590
 
 
 
 
 
 
13587d6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
554a75a
 
 
441a814
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
554a75a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
---
license: apache-2.0
datasets:
    - Novora/CodeClassifier_v1 
pipeline_tag: text-classification 
---

# Introduction

Novora Code Classifier v1 Tiny, is a tiny `Text Classification` model, which classifies given code text input under 1 of `31` different classes (programming languages).

This model is designed to be able to run on CPU, but optimally runs on GPUs.

# Info
- 1 of 31 classes output
- 512 token input dimension
- 64 hidden dimensions
- 2 linear layers
- The `snowflake-arctic-embed-xs` model is used as the embeddings model.
- Dataset split into 80% training set, 20% testing set.
- The combined test and training data is around 1000 chunks per programming language, the data is 31,100 chunks (entries) as 512 tokens per chunk, being a snippet of the code.
- Picked from the 18th epoch out of 20 done.

# Architecture

The `CodeClassifier-v1-Tiny` model employs a neural network architecture optimized for text classification tasks, specifically for classifying programming languages from code snippets. This model includes:

- **Bidirectional LSTM Feature Extractor**: This bidirectional LSTM layer processes input embeddings, effectively capturing contextual relationships in both forward and reverse directions within the code snippets.

- **Fully Connected Layers**: The network includes two linear layers. The first projects the pooled features into a hidden feature space, and the second linear layer maps these to the output classes, which correspond to different programming languages. A dropout layer with a rate of 0.5 between these layers helps mitigate overfitting.

The model's bidirectional nature and architectural components make it adept at understanding the syntax and structure crucial for code classification.

# Testing/Training Datasets
I have put here the samples entered into the training/testing pipeline, its a very small amount.

| Language     | Testing Count | Training Count |
|--------------|---------------|----------------|
| Ada          | 20            | 80             |
| Assembly     | 20            | 80             |
| C            | 20            | 80             |
| C#           | 20            | 80             |
| C++          | 20            | 80             |
| COBOL        | 14            | 55             |
| Common Lisp  | 20            | 80             |
| Dart         | 20            | 80             |
| Erlang       | 20            | 80             |
| F#           | 20            | 80             |
| Go           | 20            | 80             |
| Haskell      | 20            | 80             |
| Java         | 20            | 80             |
| JavaScript   | 20            | 80             |
| Julia        | 20            | 80             |
| Kotlin       | 20            | 80             |
| Lua          | 20            | 80             |
| MATLAB       | 20            | 80             |
| PHP          | 20            | 80             |
| Perl         | 20            | 80             |
| Prolog       | 1             | 4              |
| Python       | 20            | 80             |
| R            | 20            | 80             |
| Ruby         | 20            | 80             |
| Rust         | 20            | 80             |
| SQL          | 20            | 80             |
| Scala        | 20            | 80             |
| Swift        | 20            | 80             |
| TypeScript   | 20            | 80             |

# Example Code

```python
import torch.nn as nn
import torch.nn.functional as F

class CodeClassifier(nn.Module):
    def __init__(self, num_classes, embedding_dim, hidden_dim, num_layers, bidirectional=False):
        super(CodeClassifier, self).__init__()
        self.feature_extractor = nn.LSTM(embedding_dim, hidden_dim, num_layers, batch_first=True, bidirectional=bidirectional)
        self.dropout = nn.Dropout(0.5)  # Reintroduce dropout
        self.fc1 = nn.Linear(hidden_dim * (2 if bidirectional else 1), hidden_dim)  # Intermediate layer
        self.fc2 = nn.Linear(hidden_dim, num_classes)  # Output layer

    def forward(self, x):
        x = x.unsqueeze(1)  # Add sequence dimension
        x, _ = self.feature_extractor(x)
        x = x.squeeze(1)  # Remove sequence dimension
        x = self.fc1(x)
        x = self.dropout(x)  # Apply dropout
        x = self.fc2(x)
        return x

import torch
from transformers import AutoTokenizer, AutoModel
from pathlib import Path

def infer(text, model_path, embedding_model_name):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    
    # Load tokenizer and embedding model
    tokenizer = AutoTokenizer.from_pretrained(embedding_model_name)
    embedding_model = AutoModel.from_pretrained(embedding_model_name).to(device)
    embedding_model.eval()

    # Prepare inputs
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
    inputs = {k: v.to(device) for k, v in inputs.items()}
    
    # Generate embeddings
    with torch.no_grad():
        embeddings = embedding_model(**inputs)[0][:, 0]

    # Load classifier model
    model = CodeClassifier(num_classes=31, embedding_dim=embeddings.size(-1), hidden_dim=64, num_layers=2, bidirectional=True)
    model.load_state_dict(torch.load(model_path, map_location=device))
    model = model.to(device)
    model.eval()

    # Predict class
    with torch.no_grad():
        output = model(embeddings)
        _, predicted = torch.max(output, dim=1)

    # Language labels
    languages = [
        'Ada', 'Assembly', 'C', 'C#', 'C++', 'COBOL', 'Common Lisp', 'Dart', 'Erlang', 'F#',
        'Fortran', 'Go', 'Haskell', 'Java', 'JavaScript', 'Julia', 'Kotlin', 'Lua', 'MATLAB',
        'Objective-C', 'PHP', 'Perl', 'Prolog', 'Python', 'R', 'Ruby', 'Rust', 'SQL', 'Scala',
        'Swift', 'TypeScript'
    ]
    
    return languages[predicted.item()]

# Example usage
if __name__ == "__main__":
    example_text = "print('Hello, world!')"  # Replace with actual text for inference
    model_file_path = Path("./model.safetensors")
    predicted_language = infer(example_text, model_file_path, "Snowflake/snowflake-arctic-embed-xs")
    print(f"Predicted programming language: {predicted_language}")

```