lambdasec
/

santafixer

Text Generation

text-generation-inference

Model card Files Files and versions Community

santafixer / README.md

codelion's picture

Update README.md

36fa1bd over 1 year ago

|

3.04 kB

	---
	license: apache-2.0
	datasets:
	- lambdasec/cve-single-line-fixes
	- lambdasec/gh-top-1000-projects-vulns
	language:
	- code
	tags:
	- code
	programming_language:
	- Java
	- JavaScript
	- Python
	inference: false
	model-index:
	- name: SantaFixer
	results:
	- task:
	type: text-generation
	dataset:
	type: openai/human-eval-infilling
	name: HumanEval
	metrics:
	- name: single-line infilling pass@1
	type: pass@1
	value: 0.47
	verified: false
	- name: single-line infilling pass@10
	type: pass@10
	value: 0.74
	verified: false
	- task:
	type: text-generation
	dataset:
	type: lambdasec/gh-top-1000-projects-vulns
	name: GH Top 1000 Projects Vulnerabilities
	metrics:
	- name: pass@1 (Java)
	type: pass@1
	value: 0.26
	verified: false
	- name: pass@10 (Java)
	type: pass@10
	value: 0.48
	verified: false
	- name: pass@1 (Python)
	type: pass@1
	value: 0.31
	verified: false
	- name: pass@10 (Python)
	type: pass@10
	value: 0.56
	verified: false
	- name: pass@1 (JavaScript)
	type: pass@1
	value: 0.36
	verified: false
	- name: pass@10 (JavaScript)
	type: pass@10
	value: 0.62
	verified: false
	---

	# Model Card for SantaFixer

	<!-- Provide a quick summary of what the model is/does. -->

	This is a LLM for code that is focussed on generating bug fixes using infilling.

	## Model Details

	### Model Description

	<!-- Provide a longer summary of what this model is. -->



	- Developed by: [codelion](https://huggingface.co/codelion)
	- Model type: GPT-2
	- Finetuned from model: [bigcode/santacoder](https://huggingface.co/bigcode/santacoder)


	## How to Get Started with the Model

	Use the code below to get started with the model.

	```python
	# pip install -q transformers
	from transformers import AutoModelForCausalLM, AutoTokenizer

	checkpoint = "lambdasec/santafixer"
	device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

	tokenizer = AutoTokenizer.from_pretrained(checkpoint)
	model = AutoModelForCausalLM.from_pretrained(checkpoint,
	trust_remote_code=True).to(device)

	input_text = "<fim-prefix>def print_hello_world():\n
	<fim-suffix>\n print('Hello world!')
	<fim-middle>"
	inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
	outputs = model.generate(inputs)
	print(tokenizer.decode(outputs[0]))
	```

	## Training Details

	- GPU: Tesla P100
	- Time: ~5 hrs

	### Training Data

	The model was fine-tuned on the [CVE single line fixes dataset](https://huggingface.co/datasets/lambdasec/cve-single-line-fixes)

	### Training Procedure

	Supervised Fine Tuning (SFT)

	#### Training Hyperparameters

	- optim: adafactor
	- gradient_accumulation_steps: 4
	- gradient_checkpointing: true
	- fp16: false

	## Evaluation

	The model was tested with the [GitHub top 1000 projects vulnerabilities dataset](https://huggingface.co/datasets/lambdasec/gh-top-1000-projects-vulns)