File size: 2,409 Bytes
272b99c
 
f62b8e7
 
272b99c
f62b8e7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
---
license: cc-by-nc-sa-4.0
language:
- 'no'
---

# Model Card

NorGPT-3B-Instruction-peft is trained on top of [NorGPT-3B](https://huggingface.co/NorGLM/NorGPT-3B) model on [NO-Alpaca](https://huggingface.co/datasets/NbAiLab/norwegian-alpaca) dataset.

Prompt format:
```
{instruction} {input} : {output}
```

Inference prompt:
```
{instruction} {input} :
```

## Run the Model
```python
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

source_model_id = "NorGLM/NorGPT-3B"
peft_model_id = "NorGLM/NorGPT-3B-Instruction-peft"

config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForCausalLM.from_pretrained(source_model_id, device_map='balanced')

tokenizer_max_len = 2048
tokenizer_config = {'pretrained_model_name_or_path': source_model_id,
                            'max_len': tokenizer_max_len}
tokenizer = tokenizer = AutoTokenizer.from_pretrained(**tokenizer_config)
tokenizer.pad_token = tokenizer.eos_token

model = PeftModel.from_pretrained(model, peft_model_id)
```

## Inference Example
Load the model to evaluate on the last 20\% of NO-Alpaca dataset:
```python
def merge_columns(example):
    if str(example["input"]) == "":
        example["text"] = str(example["instruction"]) + " : "
    else:
        example["text"] = str(example["instruction"]) + " " + str(example["input"]) + " : "
    return example

def generate_text(text, max_length=200, do_sample=True, top_p = 0.92, top_k=0):
    set_seed(42)
    model_inputs = tokenizer(text, return_tensors='pt').to(torch_device)
    output = model.generate(**model_inputs, max_new_tokens = max_length, no_repeat_ngram_size=2, pad_token_id=tokenizer.eos_token_id)
    return tokenizer.decode(output[0], skip_special_tokens=True)

print("--LOADING EVAL DATAS---")
eval_data = load_dataset("NbAiLab/norwegian-alpaca", split='train[-20%:]')

print("--MAKING PREDICTIONS---")
model.eval()

output_file = <output file name>
with open(output_file, 'w', encoding='utf-8-sig') as file:
    generated_text = []
    
    for question in eval_data['text']:
        generated_text.append({"generated_text": generate_text(question)})
        print({"text_generated": len(generated_text)})

    json_lines = [json.dumps(data) for data in generated_text]
    json_data = "\n".join(json_lines)
    file.write(json_data)

```

## Note
More training details will be released soon!