File size: 2,733 Bytes
a6bb176
 
e77286d
 
 
 
 
 
 
a6bb176
ad7a79a
e77286d
 
 
 
 
 
 
 
 
 
 
 
 
 
a4d4ad6
 
e77286d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a4d4ad6
e77286d
 
 
 
 
 
 
 
a4d4ad6
 
 
 
 
 
 
e77286d
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
---
license: apache-2.0
language:
- en
tags:
- code
- knowledge extraction
- tiny
- small
---
A model that can **extract the knowledge points** from the given **C language code**. 

The base model is [pythia-70m](https://huggingface.co/EleutherAI/pythia-70m). This model was fine-tuned with 10 epochs using [Q-Lora](https://github.com/artidoro/qlora) method on my own training set.

A usage example is as follows, first import the model and prepare the code:

```python
from transformers import GPTNeoXForCausalLM, AutoTokenizer

model_name_or_path = 'Mxode/Pythia-70m-C-Language-KnowledgeExtract'
device = 'cuda'

model = GPTNeoXForCausalLM.from_pretrained(model_name_or_path).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)

# instruction template
instruction = '[Summarize the knowledge points in the code below]\n'
# any c-lang pieces you like, could be partial functions or statements
input_content = '''```c
int partition(int arr[], int low, int high) {
    int pivot = arr[high];
    int i = (low - 1);
    for (int j = low; j <= high - 1; j++) {
        if (arr[j] < pivot) {
            i++;
            swap(&arr[i], &arr[j]);
        }
    }
    swap(&arr[i + 1], &arr[high]);
    return (i + 1);
}

void quickSort(int arr[], int low, int high) {
    if (low < high) {
        int pi = partition(arr, low, high);
        quickSort(arr, low, pi - 1);
        quickSort(arr, pi + 1, high);
    }
}
```'''
text = instruction + input_content
```

Then generate:

```python
inputs = tokenizer(text, return_tensors="pt").to(device)
tokens = model.generate(
    **inputs,
    pad_token_id=tokenizer.eos_token_id,
    max_new_tokens=32,
)
response = tokenizer.decode(tokens[0]).split('```')[-1].split('<')[0]	# deduplicate inputs
```



However, in practical use, in order to achieve more diverse representations, it's recommended to do multiple inferences. Don't worry, it's really small so the inferences don't take much time, as follows:

```python
ans_dict = {}
def increment_insert(key):
    ans_dict[key] = ans_dict.get(key, 0) + 1

for i in range(30):		# maybe 20 times or less enough too
    inputs = tokenizer(text, return_tensors="pt").to(device)
    tokens = model.generate(
        **inputs,
        pad_token_id=tokenizer.eos_token_id,
        max_new_tokens=32,
        do_sample=True,
        temperature=2.0,  # high temperature for diversity
        top_p=0.95,
        top_k=30,
    )
    response = tokenizer.decode(tokens[0]).split('```')[-1].split('<')[0]
    increment_insert(response)

print(ans_dict)
### output as below, could take high-freq answers
### {
###     'Backtracking': 1,
###     'Heap': 1,
###     'Quick sort': 25,
###     'Recurrence': 2,
###     'Queue': 1
###}
```