Traditional Chinese LLM Corpus
Traditional Chinese corpus collection for LLM training (pre-training, instruction-tuning, and RLHF/alignment).
Viewer • Updated • 1.78M • 1 • 3Note Contains ~2B tokens from high quality corpus. Cleaned and deduplicated.
liswei/wikipedia-zhtw-dedup
Viewer • Updated • 1.18M • 15Note Deduplicate version of erhwenkuo/wikipedia-zhtw using MinHash.
liswei/c4-zhtw
Viewer • Updated • 4.86M • 4Note Deduplicated C4 subset of zhTW. Note: C4 = colossal, cleaned version of Common Crawl
liswei/common-crawl-zhtw
Viewer • Updated • 2.71M • 3 • 1Note Deduplicated CC subset of zhTW.
zetavg/CC-100-zh-Hant-merged
Viewer • Updated • 12.3M • 4 • 2Note Zh-tw subset of CC-100 dataset, which is derived from commoncrawl. Note: CC harms performance as shown in TaiwanLlama.
liswei/coct-en-zhtw-dedup
Viewer • Updated • 217k • 3Note Deduplicate version of zetavg/coct-en-zh-tw-translations-twp-300k. Zh-tw <-> en paired articles provided by 台灣光華雜誌.
liswei/PromptPair-TW
Viewer • Updated • 119k • 2 • 1Note Traditional Chinese instruction dataset. Contains en <-> tw pairs with system prompts to better adopt from English pre-trained models.
yentinglin/TaiwanChat
Viewer • Updated • 485k • 1.41k • 50Note Instruction dataset used to train TaiwanLLM v1. Find more details in the paper.
benchang1110/Chattw_v2
Viewer • Updated • 164k • 1Note Shares the same set of user prompts with yentinglin/TaiwanChat. The prompts are probably translated from alpaca / sharegpt etc.
erhwenkuo/alpaca-data-gpt4-chinese-zhtw
Viewer • Updated • 52k • 147 • 5Note Translated from en to zh-tw of the alpaca-gpt4 dataset.
zetavg/mlqa_en_zh_tw
Viewer • Updated • 3.29k • 4 • 6Note zhcn/en multilingual QA translated to zhtw/en. Internal experiment shows that when transferring from English base model, traning on Q:en->A:zh or vice versa improves SFT performance.
zetavg/ShareGPT-Processed
Viewer • Updated • 90.7k • 16 • 29Note The RyokoAI/ShareGPT52K dataset, converted to Markdown and labeled with the language used.
benchang1110/PTT_QA
Updated • 1
lchakkei/OpenOrca-Traditional-Chinese
Viewer • Updated • 4.23M • 11 • 8Note Google translated instruction data from English.
Heng666/Traditional_Chinese-aya_dataset
Viewer • Updated • 4.91k • 1Heng666/Traditional_Chinese-aya_evaluation_suite
Viewer • Updated • 650 • 8 • 2
ChenWeiLi/Med_Breexe_zhtw
Updated • 3Note Instruction dataset in the Medicine domain. Prompts are translated then feed to Breexe model.
Tarklanse/Traditional_Chinese_roleplay_chat_Dataset
Viewer • Updated • 9.51k • 20 • 31DataAgent/Pretrain-Taiwan-DentistKnowledge-zhTW-290K
Viewer • Updated • 147 • 3 • 1
KSmart/chinese_traditional_chengyu
Viewer • Updated • 111 • 2Note This is in Simplified Chinese.
liswei/rm-static-zhTW
Viewer • Updated • 81.4k • 30Note Perference dataset with chosen/reject pair. Translated using m2m100.
ZoneTwelve/ChineseGrammaticalErrorEvaluation
Viewer • Updated • 132ZoneTwelve/micro_sft_instruct
Viewer • Updated • 10 • 3