--- license: apache-2.0 language: - en tags: - code - knowledge extraction - tiny - small --- A model that can **extract the knowledge points** from the given **C language code**. The base model is [pythia-70m](https://huggingface.co/EleutherAI/pythia-70m). This model was fine-tuned with 10 epochs using [Q-Lora](https://github.com/artidoro/qlora) method on my own training set. A usage example is as follows, first import the model and prepare the code: ```python from transformers import GPTNeoXForCausalLM, AutoTokenizer model_name_or_path = 'Mxode/Pythia-70m-C-Language-KnowledgeExtract' device = 'cuda' model = GPTNeoXForCausalLM.from_pretrained(model_name_or_path).to(device) tokenizer = AutoTokenizer.from_pretrained(model_name_or_path) instruction = '[Summarize the knowledge points in the code below]\n' # instruction template # any c-lang pieces you like, could be partial functions or statements input_content = '''```c int partition(int arr[], int low, int high) { int pivot = arr[high]; int i = (low - 1); for (int j = low; j <= high - 1; j++) { if (arr[j] < pivot) { i++; swap(&arr[i], &arr[j]); } } swap(&arr[i + 1], &arr[high]); return (i + 1); } void quickSort(int arr[], int low, int high) { if (low < high) { int pi = partition(arr, low, high); quickSort(arr, low, pi - 1); quickSort(arr, pi + 1, high); } } ```''' text = instruction + input_content ``` Then generate: ```python inputs = tokenizer(text, return_tensors="pt").to(device) tokens = model.generate( **inputs, pad_token_id=tokenizer.eos_token_id, max_new_tokens=32, ) response = tokenizer.decode(tokens[0]).split('```')[-1].split('<')[0] # deduplicate inputs ``` However, in practical use, in order to achieve more diverse representations, it's recommended to do multiple inferences. Don't worry, it's really small so the inferences don't take much time, as follows: ```python ans_dict = {} def increment_insert(key): ans_dict[key] = ans_dict.get(key, 0) + 1 for i in range(30): # maybe 20 times or less enough too inputs = tokenizer(text, return_tensors="pt").to(device) tokens = model.generate( **inputs, pad_token_id=tokenizer.eos_token_id, max_new_tokens=32, do_sample=True, temperature=2.0, # high temperature for diversity top_p=0.95, top_k=30, ) response = tokenizer.decode(tokens[0]).split('```')[-1].split('<')[0] increment_insert(response) print(ans_dict) ### output as below, could take high-freq answers ### {'Backtracking': 1, 'Heap': 1, 'Quick sort': 25, 'Recurrence': 2, 'Queue': 1} ```