--- license: mit datasets: - flytech/python-codes-25k tags: - code --- # GPT2 PyCode This model is a fine-tuned version of the GPT 124M model, specifically adapted for testing purposes in Python code generation. It was trained on a small corpus of 25,000 Python code samples. ### Model Description This project features a GPT (Generative Pre-trained Transformer) language model with 124 million parameters that has been fine-tuned for Python code generation. Unlike larger models like GPT-2 or GPT-3, this is a smaller-scale model designed primarily for testing and experimental purposes. - **Developed by:** Maharnab Saikia - **Model type:** Language model - **Language(s) (NLP):** English - **License:** MIT - **Finetuned from model:** GPT2 124M ## Uses - **Research:** Studying the behavior of small-scale language models in code generation tasks - **Benchmarking:** Providing a baseline for comparing different model architectures or training strategies - **Rapid Prototyping:** Quick tests of code generation ideas without the overhead of larger models - **Education:** Demonstrating the principles of fine-tuning language models for specific tasks ## Bias, Risks, and Limitations It's crucial to understand the limitations of this model: - Limited knowledge base due to the small training corpus - May struggle with complex or specialized Python code - Not suitable for production-level code generation tasks - Performance will likely be significantly lower than larger, more comprehensively trained models ## How to Get Started with the Model Use the code below to get started with the model. ```python from transformers import GPT2LMHeadModel, GPT2Tokenizer import re tokenizer = GPT2Tokenizer.from_pretrained('gpt2') model = GPT2Model.from_pretrained('gpt2') text = "Replace me by any text you'd like." encoded_input = tokenizer.encode_plus(f"{prompt}", max_length=20, truncation=True, return_tensors="pt") input_ids = encoded_input['input_ids'] attention_mask = encoded_input['attention_mask'] output = model.generate( input_ids, max_length=512, num_return_sequences=1, no_repeat_ngram_size=2, temperature=0.7, do_sample=True, top_k=50, top_p=0.95, attention_mask=attention_mask, pad_token_id=tokenizer.pad_token_id ) generated_code = tokenizer.decode(output[0]) generated_code = re.search(r'(.*?)', generated_code, re.DOTALL).group(1) print(f"Prompt: {prompt}\nGenerated Code:\n{generated_code}") ``` ## Training Details ### Training Data - **Model:** GPT with 124 million parameters - **Training Data:** 25,000 Python code samples - **Fine-tuning:** Adapted specifically for Python code generation tasks #### Training Hyperparameters - **Epochs:** 5 - **Batch Size:** 8 - **Learning Rate:** 5e-5 - **Contex Window:** 512 ## Environmental Impact Carbon emissions was estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). - **Hardware Type:** P100 GPU - **Hours used:** 5 - **Cloud Provider:** Kaggle - **Compute Region:** South Asia - **Carbon Emitted:** 1.15 ## Acknowledgements This project builds upon the GPT-2 model developed by OpenAI. We acknowledge their groundbreaking work in the field of natural language processing.