gpt2_pycode / README.md
maharnab's picture
Update README.md
4ed03a4 verified
metadata
license: mit
datasets:
  - flytech/python-codes-25k
tags:
  - code
language:
  - en
library_name: transformers

GPT2 PyCode

This model is a fine-tuned version of the GPT 124M model, specifically adapted for testing purposes in Python code generation. It was trained on a small corpus of 25,000 Python code samples.

Model Description

This project features a GPT (Generative Pre-trained Transformer) language model with 124 million parameters that has been fine-tuned for Python code generation. Unlike larger models like GPT-2 or GPT-3, this is a smaller-scale model designed primarily for testing and experimental purposes.

  • Developed by: Maharnab Saikia
  • Model type: Language model
  • Language(s) (NLP): English
  • License: MIT
  • Finetuned from model: GPT2 124M

Uses

  • Research: Studying the behavior of small-scale language models in code generation tasks
  • Benchmarking: Providing a baseline for comparing different model architectures or training strategies
  • Rapid Prototyping: Quick tests of code generation ideas without the overhead of larger models
  • Education: Demonstrating the principles of fine-tuning language models for specific tasks

Bias, Risks, and Limitations

It's crucial to understand the limitations of this model:

  • Limited knowledge base due to the small training corpus
  • May struggle with complex or specialized Python code
  • Not suitable for production-level code generation tasks
  • Performance will likely be significantly lower than larger, more comprehensively trained models

How to Get Started with the Model

Use the code below to get started with the model.

from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch
import re


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = GPT2Tokenizer.from_pretrained('maharnab/gpt2_pycode')
model = GPT2LMHeadModel.from_pretrained('maharnab/gpt2_pycode')
model.to(device)

prompt = "How to reverse a string in Python."
encoded_input = tokenizer.encode_plus(f"<sos><user>{prompt}</user><assistant>", max_length=20, truncation=True, return_tensors="pt").to(device)

input_ids = encoded_input['input_ids']
attention_mask = encoded_input['attention_mask']

output = model.generate(
    input_ids, 
    max_length=512, 
    num_return_sequences=1, 
    no_repeat_ngram_size=2,
    temperature=0.7,
    do_sample=True,
    top_k=50,
    top_p=0.95,
    attention_mask=attention_mask,
    pad_token_id=tokenizer.pad_token_id
)

generated_code = tokenizer.decode(output[0])
generated_code = re.search(r'<assistant>(.*?)</assistant>', generated_code, re.DOTALL).group(1)

print(f"Prompt: {prompt}\nGenerated Code:\n{generated_code}")

Training Details

Training Data

  • Model: GPT with 124 million parameters
  • Training Data: 25,000 Python code samples
  • Fine-tuning: Adapted specifically for Python code generation tasks

Training Hyperparameters

  • Epochs: 5
  • Batch Size: 8
  • Learning Rate: 5e-5
  • Contex Window: 512

Environmental Impact

Carbon emissions was estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

  • Hardware Type: P100 GPU
  • Hours used: 5
  • Cloud Provider: Kaggle
  • Compute Region: South Asia
  • Carbon Emitted: 1.15

Acknowledgements

This project builds upon the GPT-2 model developed by OpenAI. We acknowledge their groundbreaking work in the field of natural language processing.