NorGLM commited on
Commit
d48e4d8
1 Parent(s): a236bbb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +82 -0
README.md CHANGED
@@ -1,3 +1,85 @@
1
  ---
2
  license: cc-by-nc-sa-4.0
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: cc-by-nc-sa-4.0
3
+ language:
4
+ - 'no'
5
  ---
6
+
7
+ # Model Card
8
+
9
+ NorGPT-369M-summarization-peft is trained on top of [NorGPT-369M](https://huggingface.co/NorGLM/NorGPT-369M) model on [NO-CNN-DailyMail](https://huggingface.co/datasets/NorGLM/NO-CNN-DailyMail) dataset.
10
+
11
+ Prompt format:
12
+ ```
13
+ Summarise the article:\\n{article} |||\\n{positive_sample}
14
+ ```
15
+
16
+ Inference prompt:
17
+ ```
18
+ Summarise the article:\\n{article} |||\\n
19
+ ```
20
+
21
+ ## Run the Model
22
+ ```python
23
+ from peft import PeftModel, PeftConfig
24
+ from transformers import AutoModelForCausalLM, AutoTokenizer
25
+ import torch
26
+
27
+ source_model_id = "NorGLM/NorGPT-369M"
28
+ peft_model_id = "NorGLM/NorGPT-369M-summarization-peft"
29
+
30
+ config = PeftConfig.from_pretrained(peft_model_id)
31
+ model = AutoModelForCausalLM.from_pretrained(source_model_id, device_map='balanced')
32
+
33
+ tokenizer_max_len = 2048
34
+ tokenizer_config = {'pretrained_model_name_or_path': source_model_id,
35
+ 'max_len': tokenizer_max_len}
36
+ tokenizer = tokenizer = AutoTokenizer.from_pretrained(**tokenizer_config)
37
+ tokenizer.pad_token = tokenizer.eos_token
38
+
39
+ model = PeftModel.from_pretrained(model, peft_model_id)
40
+ ```
41
+
42
+ ## Evaluation on test set
43
+ Load the model to evaluate on the last 20\% of NO-Alpaca dataset:
44
+ ```python
45
+ def generate_texts(model, tokenizer, prompts, max_seq_length=200, do_sample=True, top_p=0.95, top_k=10):
46
+ # prompts are a list of news articles
47
+ results = []
48
+ cnt = 0
49
+ for prompt in prompts:
50
+ cnt += 1
51
+ pro_len = len(prompt.split())
52
+ if pro_len>1024:
53
+ results.append('')
54
+ continue
55
+
56
+ prompt = 'Summarise the article:\\n' + prompt + ' |||\\n'
57
+
58
+ model_inputs = tokenizer(prompt, return_tensors='pt').to(torch_device)
59
+ output = model.generate(**model_inputs, do_sample=False, max_new_tokens=max_seq_length)
60
+ result = tokenizer.decode(output[0], skip_special_tokens=True)
61
+ result = result.split("|||\\n")[-1]
62
+ results.append(result)
63
+ return results
64
+
65
+ print("--LOADING EVAL DATAS---")
66
+ eval_data = load_dataset("NorGLM/NO-CNN-DailyMail", data_files="test.csv")
67
+ prompts = eval_data['train']['article']
68
+ positive_samples = eval_data['train']['positive_sample']
69
+
70
+ print("--MAKING PREDICTIONS---")
71
+ model.eval()
72
+
73
+ output_file = <output file name>
74
+ with torch.no_grad():
75
+ results = generate_texts(model, tokenizer, prompts)
76
+
77
+ df = pd.DataFrame({'article':prompts, 'generated_text':results, 'positive_sample':positive_samples})
78
+
79
+ print("Save results to csv file...")
80
+ df.to_csv(output_file)
81
+
82
+ ```
83
+
84
+ ## Note
85
+ More training details will be released soon!