|
--- |
|
datasets: |
|
- zetavg/ShareGPT-Processed |
|
- zetavg/coct-en-zh-tw-translations-twp-300k |
|
- zetavg/zh-tw-wikipedia |
|
- zetavg/tw-sinica-corpus-word-frequency |
|
- RyokoAI/ShareGPT52K |
|
language: |
|
- zh |
|
- en |
|
--- |
|
# TW-Pythia-6.9B-Chat |
|
|
|
**Taiwanese Mandarin Pythia Language Model, instruction-tuned for dialogue.** |
|
|
|
Version 0.2 |
|
|
|
## Model Details |
|
|
|
The TW-Pythia model is derived from the Apache-2.0-licenced [Pythia](https://github.com/EleutherAI/pythia) language model, with 8000 new Traditional Chinese tokens added, embed layers resized and re-trained. |
|
|
|
### Basics |
|
|
|
- **Developed by:** [@zetavg](https://github.com/zetavg) based on [EleutherAI](https://www.eleuther.ai/)'s [Pythia](https://github.com/EleutherAI/pythia) language model. |
|
- **Model type:** Transformer-based GPT-NeoX Causal Language Model |
|
- **Languages:** English, Traditional Chinese |
|
- **License:** Unknown due to unconfirmed usage license of the training data |
|
- **Derived from model:** [EleutherAI/pythia-6.9b](https://huggingface.co/EleutherAI/pythia-6.9b) |
|
|
|
### Model Sources |
|
|
|
- **Repository:** https://github.com/zetavg/twlm |
|
- **Demo:** See https://hackmd.io/@z/twlm-demo |
|
|
|
## Uses |
|
|
|
Currently, this model has not demonstrated any practical value in Traditional Chinese processing without further training, but it does possess some basic Chinese-English translation capabilities. |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
* 200k [English ↔ Traditional Chinese Sentences from the COCT Database](zetavg/coct-en-zh-tw-translations-twp-300k). |
|
* ~8k English and Traditional Chinese mixed [ShareGPT data](zetavg/ShareGPT-Processed). |
|
|
|
### Training Procedure |
|
|
|
First, we build a BPE tokenizer based on the original Pythia tokenizer with 8000 new Traditional Chinese tokens added. |
|
|
|
Then, we resize the embedding layer of the `pythia-6.9b` model to accommodate the new vocabulary size, and we train only the input/output embedding layers to allow the model to learn the new Traditional Chinese words and phrases. |
|
|
|
At last, LoRA weights are added to the model and fine-tuned for instruction following. |
|
|
|
#### Training Hyperparameters |
|
|
|
- **Training regime:** `fp32` |
|
- See: https://github.com/zetavg/twlm/blob/main/configs/ta01_p7b.yaml |
|
|
|
### Hardware |
|
|
|
* 1xH100 80GB GPU on Lambda Cloud (with Skypilot), about 20h in total. |