edumunozsala commited on
Commit
2a60883
1 Parent(s): 1e00bbd

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +189 -0
README.md ADDED
@@ -0,0 +1,189 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - axolot
4
+ - code
5
+ - coding
6
+ - Aguila
7
+ - axolot
8
+ model-index:
9
+ - name: edumunozsala/aguila-7b-instructft-bactrian-x
10
+ results: []
11
+ license: apache-2.0
12
+ language:
13
+ - code
14
+ datasets:
15
+ - MBZUAI/Bactrian-X
16
+ pipeline_tag: text-generation
17
+ ---
18
+
19
+
20
+ # Aguila 7B SFT model on spanish language 👩‍💻
21
+
22
+ **Aguila 7B** supervised instruction finetuned on the [Bactrian-X dataset](https://github.com/mbzuai-nlp/Bactrian-X) by using the **Axolot** library in 4-bit with [PEFT](https://github.com/huggingface/peft) library.
23
+
24
+ ## Pretrained description
25
+
26
+ [Aguila-7B](projecte-aina/aguila-7b)
27
+
28
+ Ǎguila-7B is a transformer-based causal language model for Catalan, Spanish, and English. It is based on the Falcon-7B model and has been trained on a 26B token trilingual corpus collected from publicly available corpora and crawlers.
29
+
30
+ More information available in the following post from Medium.com: [Introducing Ǎguila, a new open-source LLM for Spanish and Catalan](https://medium.com/@mpamies247/introducing-a%CC%8Cguila-a-new-open-source-llm-for-spanish-and-catalan-ee1ebc70bc79)
31
+
32
+
33
+ ## Training data
34
+
35
+ [MBZUAI/Bactrian-X](https://huggingface.co/datasets/MBZUAI/Bactrian-X)
36
+
37
+ The Bactrain-X dataset is a collection of 3.4M instruction-response pairs in 52 languages, that are obtained by translating 67K English instructions (alpaca-52k + dolly-15k) into 51 languages using Google Translate API. The translated instructions are then fed to ChatGPT (gpt-3.5-turbo) to obtain its natural responses, resulting in 3.4M instruction-response pairs in 52 languages (52 languages x 67k instances = 3.4M instances).
38
+
39
+ Here we only use the spanish split of the dataset.
40
+
41
+ ### Training hyperparameters
42
+
43
+ The following `axolot` configuration was used during training:
44
+ ```
45
+ base_model: projecte-aina/aguila-7b
46
+ # required by falcon custom model code: https://huggingface.co/tiiuae/falcon-7b/tree/main
47
+ trust_remote_code: true
48
+ model_type: AutoModelForCausalLM
49
+ tokenizer_type: AutoTokenizer
50
+ is_falcon_derived_model: true
51
+ load_in_8bit: false
52
+ # enable 4bit for QLoRA
53
+ load_in_4bit: true
54
+ gptq: false
55
+ strict: false
56
+
57
+ push_dataset_to_hub:
58
+ datasets:
59
+ - path: edumunozsala/Bactrian-X-es-50k
60
+ type: alpaca
61
+ dataset_prepared_path:
62
+ val_set_size: 0.05
63
+ # enable QLoRA
64
+ adapter: qlora
65
+ lora_model_dir:
66
+ sequence_len: 2048
67
+ max_packed_sequence_len:
68
+
69
+ # hyperparameters from QLoRA paper Appendix B.2
70
+ # "We find hyperparameters to be largely robust across datasets"
71
+ lora_r: 64
72
+ lora_alpha: 16
73
+ # 0.1 for models up to 13B
74
+ # 0.05 for 33B and 65B models
75
+ lora_dropout: 0.05
76
+ # add LoRA modules on all linear layers of the base model
77
+ lora_target_modules:
78
+ lora_target_linear: true
79
+ lora_fan_in_fan_out:
80
+
81
+ wandb_project:
82
+ wandb_entity:
83
+ wandb_watch:
84
+ wandb_name:
85
+ wandb_log_model:
86
+
87
+ output_dir: ./qlora-out
88
+
89
+ # QLoRA paper Table 9
90
+ # - 16 for 7b & 13b
91
+ # - 32 for 33b, 64 for 64b
92
+ # Max size tested on A6000
93
+ # - 7b: 40
94
+ # - 40b: 4
95
+ # decrease if OOM, increase for max VRAM utilization
96
+ micro_batch_size: 4
97
+ gradient_accumulation_steps: 2
98
+ num_epochs: 2
99
+ # Optimizer for QLoRA
100
+ optimizer: paged_adamw_32bit
101
+ torchdistx_path:
102
+ lr_scheduler: cosine
103
+ # QLoRA paper Table 9
104
+ # - 2e-4 for 7b & 13b
105
+ # - 1e-4 for 33b & 64b
106
+ learning_rate: 0.0002
107
+ train_on_inputs: false
108
+ group_by_length: false
109
+ bf16: auto
110
+ fp16:
111
+ tf32: true
112
+ gradient_checkpointing: true
113
+ # stop training after this many evaluation losses have increased in a row
114
+ # https://huggingface.co/transformers/v4.2.2/_modules/transformers/trainer_callback.html#EarlyStoppingCallback
115
+ # early_stopping_patience: 3
116
+ resume_from_checkpoint:
117
+ auto_resume_from_checkpoints: true
118
+ local_rank:
119
+ logging_steps: 200
120
+ xformers_attention: true
121
+ flash_attention:
122
+ gptq_groupsize:
123
+ gptq_model_v1:
124
+ warmup_steps: 10
125
+ evals_per_epoch: 1
126
+ saves_per_epoch: 1
127
+ debug:
128
+ deepspeed:
129
+ weight_decay: 0.000001
130
+ fsdp:
131
+ fsdp_config:
132
+ special_tokens:
133
+ pad_token: "<|endoftext|>"
134
+ bos_token: "<|endoftext|>"
135
+ eos_token: "<|endoftext|>"
136
+ ```
137
+
138
+ ### Framework versions
139
+ - torch=="2.1.2"
140
+ - flash-attn=="2.5.0"
141
+ - deepspeed=="0.13.1"
142
+ - axolotl=="0.4.0"
143
+
144
+ ### Example of usage
145
+
146
+ ```py
147
+ import torch
148
+ from transformers import AutoModelForCausalLM, AutoTokenizer
149
+
150
+ model_id = "edumunozsala/aguila-7b-instructft-bactrian-x"
151
+
152
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
153
+
154
+ model = AutoModelForCausalLM.from_pretrained(model_id, load_in_4bit=True, torch_dtype=torch.float16,
155
+ device_map="auto", trust_remote_code=True)
156
+
157
+ instruction="Piense en una solución para reducir la congestión del tráfico."
158
+
159
+ input=""
160
+
161
+ prompt = f"""### Instrucción:
162
+ {instruction}
163
+
164
+ ### Entrada:
165
+ {input}
166
+
167
+ ### Respuesta:
168
+ """
169
+
170
+ input_ids = tokenizer(prompt, return_tensors="pt", truncation=True).input_ids.cuda()
171
+ # with torch.inference_mode():
172
+ outputs = model.generate(input_ids=input_ids, max_new_tokens=256, do_sample=True, top_p=0.9,temperature=0.3)
173
+
174
+ print(f"Prompt:\n{prompt}\n")
175
+ print(f"Generated instruction:\n{tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0][len(prompt):]}")
176
+
177
+ ```
178
+
179
+ ### Citation
180
+
181
+ ```
182
+ @misc {edumunozsala_2023,
183
+ author = { {Eduardo Muñoz} },
184
+ title = { aguila-7b-instructft-bactrian-x },
185
+ year = 2024,
186
+ url = { https://huggingface.co/edumunozsala/aguila-7b-instructft-bactrian-x },
187
+ publisher = { Hugging Face }
188
+ }
189
+ ```