File size: 4,101 Bytes
4ed03a4
 
 
 
 
 
 
 
 
 
108499c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4ed03a4
108499c
 
 
4dcb71b
 
 
 
 
 
 
 
108499c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
---
license: mit
datasets:
- flytech/python-codes-25k
tags:
- code
language:
- en
library_name: transformers
---
# GPT2 PyCode

<!-- Provide a quick summary of what the model is/does. -->

This model is a fine-tuned version of the GPT 124M model, specifically adapted for testing purposes in Python code generation. It was trained on a small corpus of 25,000 Python code samples.

### Model Description

<!-- Provide a longer summary of what this model is. -->

This project features a GPT (Generative Pre-trained Transformer) language model with 124 million parameters that has been fine-tuned for Python code generation. Unlike larger models like GPT-2 or GPT-3, this is a smaller-scale model designed primarily for testing and experimental purposes.

- **Developed by:** Maharnab Saikia
- **Model type:** Language model
- **Language(s) (NLP):** English
- **License:** MIT
- **Finetuned from model:** GPT2 124M

## Uses

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

 - **Research:** Studying the behavior of small-scale language models in code generation tasks
 - **Benchmarking:** Providing a baseline for comparing different model architectures or training strategies
 - **Rapid Prototyping:** Quick tests of code generation ideas without the overhead of larger models
 - **Education:** Demonstrating the principles of fine-tuning language models for specific tasks

## Bias, Risks, and Limitations

<!-- This section is meant to convey both technical and sociotechnical limitations. -->

It's crucial to understand the limitations of this model:

 - Limited knowledge base due to the small training corpus
 - May struggle with complex or specialized Python code
 - Not suitable for production-level code generation tasks
 - Performance will likely be significantly lower than larger, more comprehensively trained models

## How to Get Started with the Model

Use the code below to get started with the model.

```python
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch
import re


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = GPT2Tokenizer.from_pretrained('maharnab/gpt2_pycode')
model = GPT2LMHeadModel.from_pretrained('maharnab/gpt2_pycode')
model.to(device)

prompt = "How to reverse a string in Python."
encoded_input = tokenizer.encode_plus(f"<sos><user>{prompt}</user><assistant>", max_length=20, truncation=True, return_tensors="pt").to(device)

input_ids = encoded_input['input_ids']
attention_mask = encoded_input['attention_mask']

output = model.generate(
    input_ids, 
    max_length=512, 
    num_return_sequences=1, 
    no_repeat_ngram_size=2,
    temperature=0.7,
    do_sample=True,
    top_k=50,
    top_p=0.95,
    attention_mask=attention_mask,
    pad_token_id=tokenizer.pad_token_id
)

generated_code = tokenizer.decode(output[0])
generated_code = re.search(r'<assistant>(.*?)</assistant>', generated_code, re.DOTALL).group(1)

print(f"Prompt: {prompt}\nGenerated Code:\n{generated_code}")
```

## Training Details

### Training Data

 - **Model:** GPT with 124 million parameters
 - **Training Data:** 25,000 Python code samples
 - **Fine-tuning:** Adapted specifically for Python code generation tasks


#### Training Hyperparameters

 - **Epochs:** 5
 - **Batch Size:** 8
 - **Learning Rate:** 5e-5
 - **Contex Window:** 512

## Environmental Impact

<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->

Carbon emissions was estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).

- **Hardware Type:** P100 GPU
- **Hours used:** 5
- **Cloud Provider:** Kaggle
- **Compute Region:** South Asia
- **Carbon Emitted:** 1.15

## Acknowledgements

This project builds upon the GPT-2 model developed by OpenAI. We acknowledge their groundbreaking work in the field of natural language processing.