|
--- |
|
language: |
|
- ko |
|
tags: |
|
- llama-2 |
|
- instruct |
|
- instruction |
|
pipeline_tag: text-generation |
|
license: llama2 |
|
datasets: |
|
- squarelike/OpenOrca-gugugo-ko |
|
--- |
|
|
|
# Llama-2-ko-OpenOrca-gugugo-13B |
|
|
|
This model was trained for PoC purposes. This is part of an experiment to check whether model performance improves when fine-tuned with large data of about 1 million samples. |
|
|
|
[Note] There are still many people/customers who have the wrong idea that 'Always the more data, the better,' so I showed it directly with experimental data. |
|
In fine-tuning, data quality is much more important than simply preparing a lot of data, and keyword distribution within the dataset is also important! |
|
|
|
For example, when searching for process and comparison keywords in the kkullm dataset, each is about 1% of the entire dataset. |
|
|
|
|
|
### Model Details |
|
- Base Model: [beomi/llama-2-koen-13b](https://huggingface.co/beomi/llama-2-koen-13b) |
|
|
|
### Datasets |
|
Trained on 1 million samples from the dataset. The training infrastructure used AWS g5.12xlarge x 2ea (total of NVIDIA A10G 8 GPUs). |
|
|
|
- [OpenOrca-gugugo-ko](https://huggingface.co/datasets/squarelike/OpenOrca-gugugo-ko) |
|
|
|
### Hyperparameters |
|
The hyperparameters are simply heuristic values. For reference only: |
|
|
|
```python |
|
learning_rate = 3e-5 |
|
lr_scheduler = "constant_with_warmup" |
|
batch_size = 1 |
|
gradient_accumulation_steps = 8 |
|
lora_alpha = 16 |
|
lora_r = 16 |
|
lora_dropout = 0.1 |
|
lora_target_modules = "[gate_proj, down_proj, up_proj, q_proj, k_proj, o_proj, v_proj]" |
|
use_flash_attention_2 = True |
|
``` |
|
|
|
### License |
|
- Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License, under LLAMA 2 COMMUNITY LICENSE AGREEMENT |
|
|
|
This model was created as a personal experiment, unrelated to the organization I work for. |