NorGLM commited on
Commit
f62b8e7
1 Parent(s): dd746f2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +76 -0
README.md CHANGED
@@ -1,3 +1,79 @@
1
  ---
2
  license: cc-by-nc-sa-4.0
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: cc-by-nc-sa-4.0
3
+ language:
4
+ - 'no'
5
  ---
6
+
7
+ # Model Card
8
+
9
+ NorGPT-3B-Instruction-peft is trained on top of [NorGPT-3B](https://huggingface.co/NorGLM/NorGPT-3B) model on [NO-Alpaca](https://huggingface.co/datasets/NbAiLab/norwegian-alpaca) dataset.
10
+
11
+ Prompt format:
12
+ ```
13
+ {instruction} {input} : {output}
14
+ ```
15
+
16
+ Inference prompt:
17
+ ```
18
+ {instruction} {input} :
19
+ ```
20
+
21
+ ## Run the Model
22
+ ```python
23
+ from peft import PeftModel, PeftConfig
24
+ from transformers import AutoModelForCausalLM, AutoTokenizer
25
+ import torch
26
+
27
+ source_model_id = "NorGLM/NorGPT-3B"
28
+ peft_model_id = "NorGLM/NorGPT-3B-Instruction-peft"
29
+
30
+ config = PeftConfig.from_pretrained(peft_model_id)
31
+ model = AutoModelForCausalLM.from_pretrained(source_model_id, device_map='balanced')
32
+
33
+ tokenizer_max_len = 2048
34
+ tokenizer_config = {'pretrained_model_name_or_path': source_model_id,
35
+ 'max_len': tokenizer_max_len}
36
+ tokenizer = tokenizer = AutoTokenizer.from_pretrained(**tokenizer_config)
37
+ tokenizer.pad_token = tokenizer.eos_token
38
+
39
+ model = PeftModel.from_pretrained(model, peft_model_id)
40
+ ```
41
+
42
+ ## Inference Example
43
+ Load the model to evaluate on the last 20\% of NO-Alpaca dataset:
44
+ ```python
45
+ def merge_columns(example):
46
+ if str(example["input"]) == "":
47
+ example["text"] = str(example["instruction"]) + " : "
48
+ else:
49
+ example["text"] = str(example["instruction"]) + " " + str(example["input"]) + " : "
50
+ return example
51
+
52
+ def generate_text(text, max_length=200, do_sample=True, top_p = 0.92, top_k=0):
53
+ set_seed(42)
54
+ model_inputs = tokenizer(text, return_tensors='pt').to(torch_device)
55
+ output = model.generate(**model_inputs, max_new_tokens = max_length, no_repeat_ngram_size=2, pad_token_id=tokenizer.eos_token_id)
56
+ return tokenizer.decode(output[0], skip_special_tokens=True)
57
+
58
+ print("--LOADING EVAL DATAS---")
59
+ eval_data = load_dataset("NbAiLab/norwegian-alpaca", split='train[-20%:]')
60
+
61
+ print("--MAKING PREDICTIONS---")
62
+ model.eval()
63
+
64
+ output_file = <output file name>
65
+ with open(output_file, 'w', encoding='utf-8-sig') as file:
66
+ generated_text = []
67
+
68
+ for question in eval_data['text']:
69
+ generated_text.append({"generated_text": generate_text(question)})
70
+ print({"text_generated": len(generated_text)})
71
+
72
+ json_lines = [json.dumps(data) for data in generated_text]
73
+ json_data = "\n".join(json_lines)
74
+ file.write(json_data)
75
+
76
+ ```
77
+
78
+ ## Note
79
+ More training details will be released soon!