Mxode commited on
Commit
e77286d
1 Parent(s): 9c32014

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +88 -0
README.md CHANGED
@@ -1,3 +1,91 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ language:
4
+ - en
5
+ tags:
6
+ - code
7
+ - knowledge extraction
8
+ - tiny
9
+ - small
10
  ---
11
+ A model that can **extract the knowledge points** involved from a given **C language code**.
12
+
13
+ The base model is [pythia-70m](https://huggingface.co/EleutherAI/pythia-70m). This model was fine-tuned with 10 epochs using [Q-Lora](https://github.com/artidoro/qlora) method on my own training set.
14
+
15
+ A usage example is as follows, first import the model and prepare the code:
16
+
17
+ ```python
18
+ from transformers import GPTNeoXForCausalLM, AutoTokenizer
19
+
20
+ model_name_or_path = 'Mxode/Pythia-70m-C-Language-KnowledgeExtract'
21
+ device = 'cuda'
22
+
23
+ model = GPTNeoXForCausalLM.from_pretrained(model_name_or_path).to(device)
24
+ tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
25
+
26
+ instruction = '[Summarize the knowledge points in the code below]\n' # instruction template
27
+ # any c-lang pieces you like, could be partial functions or statements
28
+ input_content = '''```c
29
+ int partition(int arr[], int low, int high) {
30
+ int pivot = arr[high];
31
+ int i = (low - 1);
32
+ for (int j = low; j <= high - 1; j++) {
33
+ if (arr[j] < pivot) {
34
+ i++;
35
+ swap(&arr[i], &arr[j]);
36
+ }
37
+ }
38
+ swap(&arr[i + 1], &arr[high]);
39
+ return (i + 1);
40
+ }
41
+
42
+ void quickSort(int arr[], int low, int high) {
43
+ if (low < high) {
44
+ int pi = partition(arr, low, high);
45
+ quickSort(arr, low, pi - 1);
46
+ quickSort(arr, pi + 1, high);
47
+ }
48
+ }
49
+ ```'''
50
+ text = instruction + input_content
51
+ ```
52
+
53
+ Then generate:
54
+
55
+ ```python
56
+ inputs = tokenizer(text, return_tensors="pt").to(device)
57
+ tokens = model.generate(
58
+ **inputs,
59
+ pad_token_id=tokenizer.eos_token_id,
60
+ max_new_tokens=32,
61
+ )
62
+ response = tokenizer.decode(tokens[0]).split('```')[-1].split('<')[0] # deduplicate inputs
63
+ ```
64
+
65
+
66
+
67
+ However, in practical use, in order to achieve more diverse representations, it's recommended to do multiple inferences. Don't worry, it's really small so the inferences don't take much time, as follows:
68
+
69
+ ```python
70
+ ans_dict = {}
71
+ def increment_insert(key):
72
+ ans_dict[key] = ans_dict.get(key, 0) + 1
73
+
74
+ for i in range(30): # maybe 20 times or less enough too
75
+ inputs = tokenizer(text, return_tensors="pt").to(device)
76
+ tokens = model.generate(
77
+ **inputs,
78
+ pad_token_id=tokenizer.eos_token_id,
79
+ max_new_tokens=32,
80
+ do_sample=True,
81
+ temperature=2.0, # high temperature for diversity
82
+ top_p=0.95,
83
+ top_k=30,
84
+ )
85
+ response = tokenizer.decode(tokens[0]).split('```')[-1].split('<')[0]
86
+ increment_insert(response)
87
+
88
+ print(ans_dict)
89
+ ### output as below, could take high-freq answers
90
+ ### {'Backtracking': 1, 'Heap': 1, 'Quick sort': 25, 'Recurrence': 2, 'Queue': 1}
91
+ ```