emre commited on
Commit
10ea8c2
·
verified ·
1 Parent(s): 5d48143

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +58 -0
README.md ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: afl-3.0
3
+ tags:
4
+ - code
5
+ - java
6
+ - doc2vec
7
+ - gensim
8
+ ---
9
+
10
+ # Doc2Vec Java Methods Model
11
+
12
+ ## Model Description
13
+
14
+ This Doc2Vec model is trained on a large corpus of Java methods to learn vector representations for Java code snippets. It's designed to capture the semantic meaning of code fragments, enabling tasks such as code similarity search, code clustering, and code recommendation. The model is useful for developers, data scientists, and researchers working on source code analysis, aiding in code maintenance, refactoring, and understanding.
15
+
16
+ ## How It Works
17
+
18
+ Doc2Vec is an unsupervised algorithm to generate vector representations for documents. Unlike traditional NLP models that focus on words or sentences, Doc2Vec extends the idea to documents or, in this case, code snippets. This allows the model to capture the context of a piece of code in a multidimensional space, facilitating similarity comparisons and clustering.
19
+
20
+ ## Training Process
21
+
22
+ The model was trained using the `gensim` library's Doc2Vec implementation, with the following key hyperparameters:
23
+
24
+ - Vector size: 200
25
+ - Window size: 10
26
+ - Minimum count: 5
27
+ - Workers: 4 (for parallel processing)
28
+ - Epochs: 6
29
+
30
+ ### Data Preprocessing
31
+
32
+ The dataset used for training, `anjandash/java-8m-methods-v2`, consists of 8 million Java methods. We combined training and validation splits for the training process and used half of the test split as additional training data, with the remaining half reserved for model evaluation. The data was tokenized using simple whitespace tokenization.
33
+
34
+ ## Limitations and Biases
35
+
36
+ ### Limitations
37
+
38
+ - The model's performance is highly dependent on the diversity and quality of the training data. While it has been trained on a large dataset of Java methods, its effectiveness on code from significantly different contexts or programming languages may be limited.
39
+ - Vector representations are sensitive to the choice of hyperparameters. The current settings were chosen based on general best practices, but there might be room for optimization for specific use cases.
40
+
41
+ ### Potential Biases
42
+
43
+ - The training dataset is derived from publicly available Java methods, which may not represent all coding styles or practices equally. This could lead to biases in the model, favoring more common or popular coding conventions over others.
44
+
45
+ ## How to Use
46
+
47
+ To use this model, you'll need the `gensim` library. Here's a quick example:
48
+
49
+ ```python
50
+ from gensim.models.doc2vec import Doc2Vec
51
+
52
+ model = Doc2Vec.load("path_to_model/java_8m_methods_doc2vec.model")
53
+
54
+ # Infer vector for a new document (code snippet)
55
+ vector = model.infer_vector(["public", "static", "void", "main", "String[]", "args"])
56
+
57
+ # Find similar documents
58
+ similar_docs = model.dv.most_similar([vector], topn=5)