--- license: apache-2.0 datasets: - lambdasec/cve-single-line-fixes - lambdasec/gh-top-1000-projects-vulns language: - code tags: - code programming_language: - Java - JavaScript - Python inference: false model-index: - name: SantaFixer results: - task: type: text-generation dataset: type: openai/human-eval-infilling name: HumanEval metrics: - name: single-line infilling pass@1 type: pass@1 value: 0.47 verified: false - name: single-line infilling pass@10 type: pass@10 value: 0.74 verified: false - task: type: text-generation dataset: type: lambdasec/gh-top-1000-projects-vulns name: GH Top 1000 Projects Vulnerabilities metrics: - name: pass@1 (Java) type: pass@1 value: 0.26 verified: false - name: pass@10 (Java) type: pass@10 value: 0.48 verified: false - name: pass@1 (Python) type: pass@1 value: 0.31 verified: false - name: pass@10 (Python) type: pass@10 value: 0.56 verified: false - name: pass@1 (JavaScript) type: pass@1 value: 0.36 verified: false - name: pass@10 (JavaScript) type: pass@10 value: 0.62 verified: false --- # Model Card for SantaFixer This is a LLM for code that is focussed on generating bug fixes using infilling. ## Model Details ### Model Description - **Developed by:** [codelion](https://huggingface.co/codelion) - **Model type:** GPT-2 - **Finetuned from model:** [bigcode/santacoder](https://huggingface.co/bigcode/santacoder) ## How to Get Started with the Model Use the code below to get started with the model. ```python # pip install -q transformers from transformers import AutoModelForCausalLM, AutoTokenizer checkpoint = "lambdasec/santafixer" device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') tokenizer = AutoTokenizer.from_pretrained(checkpoint) model = AutoModelForCausalLM.from_pretrained(checkpoint, trust_remote_code=True).to(device) input_text = "def print_hello_world():\n \n print('Hello world!') " inputs = tokenizer.encode(input_text, return_tensors="pt").to(device) outputs = model.generate(inputs) print(tokenizer.decode(outputs[0])) ``` ## Training Details - **GPU:** Tesla P100 - **Time:** ~5 hrs ### Training Data The model was fine-tuned on the [CVE single line fixes dataset](https://huggingface.co/datasets/lambdasec/cve-single-line-fixes) ### Training Procedure Supervised Fine Tuning (SFT) #### Training Hyperparameters - **optim:** adafactor - **gradient_accumulation_steps:** 4 - **gradient_checkpointing:** true - **fp16:** false ## Evaluation The model was evaluted on the [GitHub top 1000 projects vulnerabilities dataset](https://huggingface.co/datasets/lambdasec/gh-top-1000-projects-vulns)