File size: 4,101 Bytes
4ed03a4 108499c 4ed03a4 108499c 4dcb71b 108499c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 |
---
license: mit
datasets:
- flytech/python-codes-25k
tags:
- code
language:
- en
library_name: transformers
---
# GPT2 PyCode
<!-- Provide a quick summary of what the model is/does. -->
This model is a fine-tuned version of the GPT 124M model, specifically adapted for testing purposes in Python code generation. It was trained on a small corpus of 25,000 Python code samples.
### Model Description
<!-- Provide a longer summary of what this model is. -->
This project features a GPT (Generative Pre-trained Transformer) language model with 124 million parameters that has been fine-tuned for Python code generation. Unlike larger models like GPT-2 or GPT-3, this is a smaller-scale model designed primarily for testing and experimental purposes.
- **Developed by:** Maharnab Saikia
- **Model type:** Language model
- **Language(s) (NLP):** English
- **License:** MIT
- **Finetuned from model:** GPT2 124M
## Uses
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
- **Research:** Studying the behavior of small-scale language models in code generation tasks
- **Benchmarking:** Providing a baseline for comparing different model architectures or training strategies
- **Rapid Prototyping:** Quick tests of code generation ideas without the overhead of larger models
- **Education:** Demonstrating the principles of fine-tuning language models for specific tasks
## Bias, Risks, and Limitations
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
It's crucial to understand the limitations of this model:
- Limited knowledge base due to the small training corpus
- May struggle with complex or specialized Python code
- Not suitable for production-level code generation tasks
- Performance will likely be significantly lower than larger, more comprehensively trained models
## How to Get Started with the Model
Use the code below to get started with the model.
```python
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch
import re
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = GPT2Tokenizer.from_pretrained('maharnab/gpt2_pycode')
model = GPT2LMHeadModel.from_pretrained('maharnab/gpt2_pycode')
model.to(device)
prompt = "How to reverse a string in Python."
encoded_input = tokenizer.encode_plus(f"<sos><user>{prompt}</user><assistant>", max_length=20, truncation=True, return_tensors="pt").to(device)
input_ids = encoded_input['input_ids']
attention_mask = encoded_input['attention_mask']
output = model.generate(
input_ids,
max_length=512,
num_return_sequences=1,
no_repeat_ngram_size=2,
temperature=0.7,
do_sample=True,
top_k=50,
top_p=0.95,
attention_mask=attention_mask,
pad_token_id=tokenizer.pad_token_id
)
generated_code = tokenizer.decode(output[0])
generated_code = re.search(r'<assistant>(.*?)</assistant>', generated_code, re.DOTALL).group(1)
print(f"Prompt: {prompt}\nGenerated Code:\n{generated_code}")
```
## Training Details
### Training Data
- **Model:** GPT with 124 million parameters
- **Training Data:** 25,000 Python code samples
- **Fine-tuning:** Adapted specifically for Python code generation tasks
#### Training Hyperparameters
- **Epochs:** 5
- **Batch Size:** 8
- **Learning Rate:** 5e-5
- **Contex Window:** 512
## Environmental Impact
<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
Carbon emissions was estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
- **Hardware Type:** P100 GPU
- **Hours used:** 5
- **Cloud Provider:** Kaggle
- **Compute Region:** South Asia
- **Carbon Emitted:** 1.15
## Acknowledgements
This project builds upon the GPT-2 model developed by OpenAI. We acknowledge their groundbreaking work in the field of natural language processing. |