license: mit
datasets:
- flytech/python-codes-25k
tags:
- code
language:
- en
library_name: transformers
GPT2 PyCode
This model is a fine-tuned version of the GPT 124M model, specifically adapted for testing purposes in Python code generation. It was trained on a small corpus of 25,000 Python code samples.
Model Description
This project features a GPT (Generative Pre-trained Transformer) language model with 124 million parameters that has been fine-tuned for Python code generation. Unlike larger models like GPT-2 or GPT-3, this is a smaller-scale model designed primarily for testing and experimental purposes.
- Developed by: Maharnab Saikia
- Model type: Language model
- Language(s) (NLP): English
- License: MIT
- Finetuned from model: GPT2 124M
Uses
- Research: Studying the behavior of small-scale language models in code generation tasks
- Benchmarking: Providing a baseline for comparing different model architectures or training strategies
- Rapid Prototyping: Quick tests of code generation ideas without the overhead of larger models
- Education: Demonstrating the principles of fine-tuning language models for specific tasks
Bias, Risks, and Limitations
It's crucial to understand the limitations of this model:
- Limited knowledge base due to the small training corpus
- May struggle with complex or specialized Python code
- Not suitable for production-level code generation tasks
- Performance will likely be significantly lower than larger, more comprehensively trained models
How to Get Started with the Model
Use the code below to get started with the model.
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch
import re
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = GPT2Tokenizer.from_pretrained('maharnab/gpt2_pycode')
model = GPT2LMHeadModel.from_pretrained('maharnab/gpt2_pycode')
model.to(device)
prompt = "How to reverse a string in Python."
encoded_input = tokenizer.encode_plus(f"<sos><user>{prompt}</user><assistant>", max_length=20, truncation=True, return_tensors="pt").to(device)
input_ids = encoded_input['input_ids']
attention_mask = encoded_input['attention_mask']
output = model.generate(
input_ids,
max_length=512,
num_return_sequences=1,
no_repeat_ngram_size=2,
temperature=0.7,
do_sample=True,
top_k=50,
top_p=0.95,
attention_mask=attention_mask,
pad_token_id=tokenizer.pad_token_id
)
generated_code = tokenizer.decode(output[0])
generated_code = re.search(r'<assistant>(.*?)</assistant>', generated_code, re.DOTALL).group(1)
print(f"Prompt: {prompt}\nGenerated Code:\n{generated_code}")
Training Details
Training Data
- Model: GPT with 124 million parameters
- Training Data: 25,000 Python code samples
- Fine-tuning: Adapted specifically for Python code generation tasks
Training Hyperparameters
- Epochs: 5
- Batch Size: 8
- Learning Rate: 5e-5
- Contex Window: 512
Environmental Impact
Carbon emissions was estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).
- Hardware Type: P100 GPU
- Hours used: 5
- Cloud Provider: Kaggle
- Compute Region: South Asia
- Carbon Emitted: 1.15
Acknowledgements
This project builds upon the GPT-2 model developed by OpenAI. We acknowledge their groundbreaking work in the field of natural language processing.