|
--- |
|
language: |
|
- vie |
|
pipeline_tag: text-generation |
|
|
|
Trained: Fine-tuning |
|
Config file: 2.7B |
|
--- |
|
# Model Card for Model ID |
|
|
|
This model is pretrained and fine-tuned with Vietnamese language, based on GPT-NeoX which is a large language model developed by EleutherAI. |
|
|
|
|
|
## Model Details |
|
|
|
### Training Data |
|
- **Pre-train:** |
|
Culturax Vietnamese Dataset(450GB) + AI-Hub Vietnamese Dataset(1.3GB) + Crawled Vietnamese Wikipedia Dataset(630MB) + viwik18 Dataset(1.27GB) |
|
- **Fine-tuning:** |
|
12MB Vietnamese Question & Answer dataset |
|
Vietnamese Alpaca(16412 rows) + Vietnamese QA Dataset based on viwik18(14293 rows) |
|
|
|
### Training Hardware |
|
Trained on A100 40GB GPU and 48 core CPU. Took 18 hours to reach 10 epochs. |
|
|
|
### Hyperparameters |
|
<figure style="width:30em"> |
|
|
|
| Hyperparameter | Value | |
|
| ---------------------- | ----------- | |
|
| num_train_epochs | 2670182400 | |
|
| train_batch_size | 2 | |
|
| learning_rate | 0.0001 | |
|
| warmup_steps | 1000 | |
|
| weight_decay | 0 | |
|
</figure> |
|
|
|
### How to use |
|
The model can be loaded using the `AutoModelForCausalLM` functionality: |
|
```python |
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("eunyounglee/GPT-NeoX-2.7B-Vietnamese-finetune") |
|
model = AutoModelForCausalLM.from_pretrained("eunyounglee/GPT-NeoX-2.7B-Vietnamese-finetune") |
|
``` |
|
|