Introduction
- The TAIDE project aims to develop a generative AI dialogue engine model that is tailored to the linguistic and cultural characteristics of Taiwan, while also establishing a trustworthy AI environment. By combining academic, industrial, and research resources, the project seeks to advance the development of trustworthy generative AI, enhancing Taiwan's international competitiveness, promoting industrial development, and reducing dependence on foreign technologies.
- The Llama 3.1 TAIDE series models are based on Meta's released Llama 3.1-8B, incorporating text and training materials from various fields in Taiwan to enhance the model's ability to respond in Traditional Chinese and perform specific tasks. The publicly released models are as follows:
- Llama-3.1-TAIDE-LX-8B-Chat: Based on Llama 3.1-8B, continuous pretrained using Traditional Chinese data, and enhanced for office tasks and multi-turn dialogue capabilities through instruction tuning. Suitable for scenarios involving chat dialogue or task assistance.
Model Parameters
- Parameters: 8.5B
- Max context length: 131K
- Training token in Traditional Chinese: 45B
- Training time: 2896 H100 GPU Hours
Features
- Strictly vetting training data for the model to improve its trustworthiness and applicability.
- Enhancing the model's performance for tasks such as summarization, writing articles, writing letters, and translating between Chinese and English, which are commonly used in office settings.
- Strengthening the model's understanding of Taiwan-specific culture, terminology, and context.
- Equip the model with multi-turn conversation capabilities.
- Improving the model’s ability to handle long-context.
- Achieving a 20% faster decoding speed for Traditional Chinese.
Examples
Task | User input | Model output |
---|---|---|
Writing articles | 請以以下內容為基礎,寫一篇文章:撰寫一篇作文,題目為《一張舊照片》,內容要求為:選擇一張令你印象深刻的照片,說明令你印象深刻的原因,並描述照片中的影像及背後的故事。記錄成長的過程、與他人的情景、環境變遷和美麗的景色。 | 一張舊照片,勾起了我心中最深刻的回憶。這張照片是我的曾祖父母年輕時的合影,背景中是一段充滿回憶的的海岸線。相片中,我的曾祖母穿著一襲簡約的旗袍,手持一束玫瑰,笑容如花般燦爛。我的曾祖父則是穿著一套筆挺的西裝,神情自信,一手在曾祖母的肩上,另一手則是指著遠方的景色。 令我印象深刻的原因是,這張照片是我的家族成員在幾十年前,所拍下的。在那個年代,拍照是一件大事,需要花費不少的時間和金錢。我的家族在當時的經濟環境下,並非富裕,卻仍願意花費時間和金錢來拍下這張照片。這種精神令我深感佩服和感動。 照片中的影像,呈現了一段美好的時光。我的曾祖母的笑容,依舊清晰地印在我的腦海中。當我看著這張照片時,我可以感受到她的快樂和幸福。背景中的海岸線,當時是一段未開發的秘境,充滿著未知的危機和機會。在那個年代,我的家族在這片土地上,種植了作物、捕魚和養豬,一點一滴地努力,打造了一個穩定的家園。 照片所呈現的場景,還包括了當時的環境變遷。相片中的背景是一段海岸線,我可以看見當時的海岸線是一段未開發的秘境,到處都是雜草和石頭。隨著時間的推移,我的家族在這片土地上,進行著開墾和建設。高大的房屋、寬闊的馬路和繁榮的市集,一一出現在我的眼前。這個景象,讓我知道我的家族在這片土地上,付出了多少的努力和心血。 這張舊照片,還記錄了美麗的景色。相片中的背景,呈現了一段渾然天成的海岸風光。遼闊的天空、廣闊的大海和連綿的山脈,一一地呈現在我的眼前。我的曾祖母和曾祖父的背影,逐漸地消失在這片景色中,留下的是一個永恆的回憶。 這張舊照片,不只是一張相片,而是一段回憶的傳承。它記錄了我的家族在這片土地上,所經歷過的成長、努力和變遷。每當我看著這張相片時,我都能感受到我的家族在這片土地上,所付出的心血和努力。這張舊照片,不但是一個家族的故事,也是一個時代的見證。 在這張舊照片的背後,我還能感受到一種家族的情感。我的曾祖母和曾祖父的笑容,讓我感受到了家族的愛和關懷。相片中的景象,不但呈現了一段美好的時光,也呈現了家族的傳統和精神。每當我看著這張相片時,我都能感受到我的家族在這片土地上,所留下的痕跡和情感。 這張舊照片,不但是一張相片,也是一個回憶的催化劑。它讓我回想起了我的家族在這片土地上,所經歷過的種種故事和回憶。每當我看著這張相片時,我都能感受到我的家族在這片土地上,所留下的愛和關懷。 |
Writing letters | 請以品牌經理的身份,給廣告公司的創意總監寫一封信,提出對於新產品廣告宣傳活動的創意建議。 | 尊敬的創意總監: 首先,感謝您在百忙之中抽出時間閱讀我的信。我代表本公司的行銷部門,對於即將推出的新產品「超能電池」感到非常興奮。我們相信,這個具有革命性突破的新產品,將為市場帶來重大的影響。為了確保新產品的成功上市,並引起目標客群的關注,我們希望在廣告宣傳方面能夠做到最好。 為此,我提出幾項創意建議,供您在策劃新產品廣告活動時作為參考。 一、創意概念: 我們建議以「超乎想像,力挽萬象」作為廣告活動的核心概念。這個概念要傳達出新產品「超能電池」的超高性能,以及它能夠為各種裝置充電的強大能力。在廣告中可以加入一些科幻元素,如未來世界、高科技產品等,以吸引目標客群的注意。 二、跨界合作: 為了擴大產品的影響力,我們建議與業界內其他知名品牌合作,共同推廣新產品。例如,與知名手機製造商合作,推出限量版搭載「超能電池」的手機;或與熱門電競品牌合作,讓新產品成為電競賽事的指定電源裝置。這不但能提高產品認知度,還能增加產品在目標客群中的好感度。 三、故事行銷: 人類從古至今都喜歡聽故事,故事行銷是一種能打動人心的行銷手法。我們在廣告中可以編織一個關於「超能電池」的傳奇故事,描述它如何在關鍵時刻拯救了某人或某物,展現出它的神祕與強大。這種故事行銷的方式能引起目標客群的共鳴,加深他們對產品的印象。 四、網路社群操作: 在數位時代,社群媒體已成為人們獲取資訊的重要來源。我們建議在Facebook、Instagram、YouTube等平台上,設立與新產品相關的粉絲專頁和頻道。利用網路紅人、KOL或KOC的力量,讓他們體驗並分享「超能電池」的使用心得。同時,我們也可以舉辦網路活動,如線上有獎徵答、產品試用抽獎等,吸引目標客群參與,提高產品知名度。 五、戶外廣告: 在城市街道、地鐵站、機場等高流量場所,設置大型戶外廣告牌,展示新產品「超能電池」的形象和優勢。這些醒目的廣告可以吸引過往的行人目光,對他們產生印象深刻的影響。此外,我們在廣告牌上加入AR(擴增實境)技術,讓消費者可以透過手機螢幕看到產品栩栩如生的樣子,增加互動性和趣味性。 六、新聞發布會: 為了給新產品一個隆重的登場,我們建議舉辦一場新聞發布會。在會上,我們可以邀請業界專家、媒體記者和目標客群代表等出席,向大家介紹新產品的技術規格、特點和優勢。同時,我們也可以設想一些現場體驗環節,讓嘉賓親身體驗「超能電池」的威力。 以上是我對新產品廣告宣傳活動的幾項創意建議。希望這些想法能對您有所幫助,並為新產品的上市帶來巨大的成功。如果您有任何疑問或需進一步討論,請隨時與我聯絡。謝謝您的關注,期待與您共創輝煌的廣告事業! 致敬, [您的姓名] [您的職銜] [您的公司] |
Summarization | 請將這篇文章精簡條理化:「產業創新條例第10條之2及第72條條文修正案」俗稱「台版晶片法」,針對半導體、電動車、5G等技術創新且居國際供應鏈關鍵地位公司,提供最高25%營所稅投抵優惠,企業適用要件包含當年度研發費用、研發密度達一定規模,且有效稅率達一定比率。 為因應經濟合作暨發展組織(OECD)國家最低稅負制調整,其中有效稅率門檻,民國112年訂為12%,113年料將提高至15%,但仍得審酌國際間最低稅負制實施情形。 經濟部官員表示,已和財政部協商進入最後階段,除企業研發密度訂在6%,目前已確認,企業購置先進製程的設備投資金額達100億元以上可抵減。 財政部官員表示,研商過程中,針對台灣產業與其在國際間類似的公司進行深入研究,在設備部分,畢竟適用產創10之2的業者是代表台灣隊打「國際盃」,投入金額不達100億元,可能也打不了。 至於備受關注的研發費用門檻,經濟部官員表示,歷經與財政部來回密切討論,研發費用門檻有望落在60億至70億元之間。 財政部官員指出,研發攸關台灣未來經濟成長動能,門檻不能「高不可攀」,起初雖設定在100億元,之所以會調降,正是盼讓企業覺得有辦法達得到門檻、進而適用租稅優惠,才有動力繼續投入研發,維持國際供應鏈關鍵地位。 經濟部官員表示,因廠商研發費用平均為30、40億元,其中,IC設計業者介於30億至60億元範圍,若將門檻訂在100億元,符合條件的業者較少、刺激誘因不足;此外,若符合申請門檻的業者增加,將可提高企業在台投資金額,財政部稅收也能因此獲得挹注。 IC設計業者近日頻頻針對產創10之2發聲,希望降低適用門檻,加上各國力拚供應鏈自主化、加碼補助半導體產業,經濟部官員表示,經濟部和財政部就產創10之2達成共識,爭取讓更多業者受惠,盼增強企業投資力道及鞏固台灣技術地位。 財政部官員表示,租稅獎勵的制定必須「有為有守」,並以達到獎勵設置目的為最高原則,現階段在打「國內盃」的企業仍可適用產創第10條、10之1的租稅優惠,共同壯大台灣經濟發展。 經濟部和財政部正就研發費用門檻做最後確認,待今明兩天預告子法之後,約有30天時間,可與業界進一步討論及調整,盼產創10之2能在6月上路。 |
2023年5月15日起的台灣「產業創新條例」第10條之2和第72條修正重點為針對半導體、電動車、5G等國際供應鏈關鍵企業提供研發密度達標且繳納一定稅率的企業最高25%的稅收抵免優惠。此「台版晶片法」旨在鼓勵技術創新。企業適用條件包括研發費用佔當年度營業額的6%以上、研發密度達到一定規模及稅率高於特定門檻。因應國際最低稅負的調整,台灣的有效稅率門檻將在今後兩年內從12%提高到15%。 經濟部與財政部正在就企業購置先進製程設備的投資抵減額進行最後商議,並確認了100億元的投資門檻。對於這一制度,財政部和經濟部已進行廣泛的研究,並取得了共識,希望通過降低門檻鼓勵企業加碼研發投入,維持台灣在世界市場供應鏈中的關鍵地位。 為進一步鼓勵研發投入,目前門檻預計將落在之前估計的60億至70億元之間。財政部關注研發對台灣未來的重要性,設定門檻不能過高,以免嚇退企業。此外,希望通過調整門檻提高企業對研發投資的動力。 在業界的呼籲下,經濟部和財政部已經就降低產創10之2的門檻達成了一致,期望加強企業投資能力,並鞏固台灣在科技領域的領導地位。 對於租稅優惠的制定,財政部強調須維持一貫性且優先用於促進產業發展,並且會在不影響財政收入的情況下進行調整。在未來,企業仍將可以舊系統的優惠條文申請稅收減免,配合新制度的實施。經濟部和財政部正在對研發費用門檻做最後的確認,預告子法發佈後與業界溝通細調,希望盡快在6月實施產創10之2。 |
Translation | 以下提供英文內容,請幫我翻譯成中文。Dongshan coffee is famous for its unique position, and the constant refinement of production methods. The flavor is admired by many caffeine afficionados. | 東山咖啡聞名遐邇,其獨樹一幟的地方在在展現,而源源不絕的改良更是令人驚嘆。眾多咖啡愛好者無不為其醇厚口感與馥郁香氣傾倒。 |
Usage
Sample code
- The sample code and documentation will be released on GitHub later.
Prompt template
- Normal QA
chat = [ {"role": "user", "content": "{question}"}, ] prompt = tokenizer.apply_chat_template(chat)
- Replace {question} with user input
- QA with system prompt
chat = [ {"role": "system", "content": "{sys}"}, {"role": "user", "content": "{question}"}, ] prompt = tokenizer.apply_chat_template(chat)
- Replace {sys} with system prompt. The system prompt for this model is: 你是一個來自台灣的AI助理,你的名字是 TAIDE,樂於以台灣人的立場幫助使用者,會用繁體中文回答問題。
- Replace {question} as user input
- Multi turns conversation
chat = [ {"role": "system", "content": "{sys}"}, {"role": "user", "content": "{question1}"}, {"role": "assistant", "content": "{model_anwer_1}"}, {"role": "user", "content": "{question2}"}, ] prompt = tokenizer.apply_chat_template(chat)
- Replace {sys} with system prompt. e.g.:你是一個來自台灣的AI助理,你的名字是 TAIDE,樂於以台灣人的立場幫助使用者,會用繁體中文回答問題。
- Replace {question1} with user input 1
- Replace {model_anwer_1} with model response 1
- Replace {question2} with user input 2
- For more details, please refer to the Llama 3.1 documentation
- Normal QA
Training methods
- Software / hardware spec
- GPU: H100
- Training Framework: PyTorch
- Data preprocessing
- Character normalization
- Deduplication
- Denoise
- Html tag and javascript in web content
- Non-standard characters or garbage characters
- Posts with an insufficient number of characters
- Removing specific formats such as extra line breaks added for formatting purposes
- Removing personal information such as emails and phone numbers.
- Remove inappropriate content such as gambling, pornography, etc..
- Continuous pretraining (CP)
- Supplementing the model with a large amount of reliable Traditional Chinese knowledge.
- Hyper parameters
- optimizer: FusedAdam
- learning rate: 1e-4
- batch size: 1M tokens
- epoch: 1
- Fine tune (FT)
- Enabling the model to answer questions in Traditional Chinese.
- Hyper parameters
- optimizer: FusedAdam
- learning rate: 1e-5
- batch size: 256K tokens
- epoch: 5
Training Data
- Continuous pre-training data (about 240GB)
Dataset Description Litigation Data Civil litigation data from various levels of courts in the judicial rulings, including data from 2013/01 to 2023/12. CNA news The CNA news includes daily news articles from June 1993 to June 2023, spanning a period of 30 years. The content covers various domains such as domestic and international politics, society, economy, culture, education, and lifestyle. ETtoday news ETtoday news data, including data from 2011/10 to 2023/12. Legislative Yuan Gazette The Legislative Yuan Gazette contains data from the 1st session of the 8th term to the 7th session of the 10th term. Publisher Website Book Introduction Includes book introduction data from the websites of SunColor, Gotop publishers. Abstracts of GRB research projects GRB is an information system that compiles research projects funded by government grants and their outcome reports. This dataset primarily includes research project abstracts from 1993 to 2023, including both Chinese and their English counterparts. Academic conference proceedings abstracts The database contains academic conference proceedings held in Taiwan from 1988 to 2009. Taiwan Panorama magazine Taiwan Panorama magazine contains articles from July 1993 to June 2023, spanning 30 years. The content focuses on Taiwanese culture, tourism, and local customs. 樂詞網 《樂詞網》covers approximately 187,000 academic terms in the humanities and social sciences, along with their translations. Data from various ministries and commissions Including partial data from government department websites such as the Executive Yuan's "National Overview", the Ministry of Culture's "National Cultural Memory Bank", the National Development Council's "Archives Support Teaching Network", the Ministry of Transportation's "Traffic Safety Portal", etc. Business Today Business Today Magazine is a weekly magazine focused on finance. The dataset includes articles from 2008/01 to 2023/07. Mandarin and idiom dictionary from the Ministry of Education Dataset including:
Idiom Dictionary: Contains 5,338 idioms, including definitions, original stories, usage explanations, and example sentences.
Revised Mandarin Dictionary: contains Chinese words and various vocabulary, including pronunciation, radicals, definitions, and other information, totaling approximately 165,539 entries.
Concise Mandarin Dictionary: is a condensed version of the "Revised Mandarin Dictionary", containing a total of 45,247 entries.SCITechVista The dataset includes science news and popular science articles from the SCITechVista website. iKnow The iKnow platform provides information on market trends, strategic analysis, patent knowledge, and technology transaction information for Taiwan and the global technology industry. The dataset includes data from 2005/01 to 2023/07. Science Development Monthly Magazine Science Development Monthly Magazine is a popular science publication published by the National Science Council (NSC) to promote science education. It includes articles from 2004/10 to 2020/12. In 2021, the magazine was relaunched as "CharmingSCITech" quarterly, providing new knowledge on international technology issues. Legislation Database The Legislation Database includes the latest central regulations, rules, draft bills, and local regulations issued by government agencies as of 2023/10. Local Government Tourism Websites Covering partial data from tourism websites of local government counties and cities in Taiwan. Curriculum Guidelines from the National Institute of Education The dataset includes curriculum guidelines for different subjects at various levels of education. CNA's English and Chinese Name Translation Database The English and Chinese Name Translation Database of the Central News Agency (CNA) collects translations of foreign and Chinese surnames, personal names, organizations, and place names used in news. Fairy tales A total of 20 fairy tale books, including "Tom Sawyer," "Peter Pan," "Alice's Adventures in Wonderland," "Uncle Long Legs," and more. RedPajama-Data-V2 Extracting English data from the RedPajama-Data-v2 multilingual dataset MathPile-commercial A mathematics-focused dataset obtained from MathPile-commercial Traditional Chinese Wikipedia Articles The content of all articles in Traditional Chinese Wikipedia, up to January 2023. github-code-clean An open-source code dataset on GitHub. After removing unlicensed code and documents. - Fine tune data
- The TAIDE team uses the Llama2 series models to generate fine-tuning data, covering tasks such as world knowledge, creative writing, general knowledge, translation, summarization, programming, and Taiwanese values in both single-turn and multi-turn dialogues. In total, there are 133K prompt-response pairs.
Evaluation
taide-bench
- Evaluation Data
- Tasks include writing articles, writing letters, summarizing articles, translating from English to Traditional Chinese, translating from Traditional Chinese to English. There are 500 questions in total.
- Data link: taide-bench
- Evaluation method
- LLM as a Judge by GPT4
- code link: taide-bench-eval
- Scores
Model Translating from Traditional Chinese to English Translating from English to Traditional Chinese Summerization Writing articles Writing letters Average Llama-3.1-TAIDE-LX-8B-Chat 7.50 8.00 7.33 8.75 9.05 8.126 Llama3-TAIDE-LX-8B-Chat-Alpha1 7.77 8.28 8.50 9.61 8.95 8.620 meta-llama/Llama-3.1-8B-Instruct 6.63 7.76 7.12 7.71 7.79 7.402
- Evaluation Data
CLongEval (Long-Context Evaluation)
- Evaluation data
- This benchmark covers seven tasks: long story Q&A, long conversation memory, long story summarization, stacked news labeling, stacked typo detection, key-passage retrieval, and table querying, for a total of 7,267 test samples. These are split into small/medium/large set (2,605/2,653/2,005 samples) based on context length (about 1K–16K, 16K–50K, 50K–100K tokens). For TAIDE, all data was converted to Traditional Chinese using OpenCC for evaluation.
- Data link: CLongEval (Hugging Face)
- Evaluation method
- Metrics-based scoring
- Scoring script: CLongEval
- Scores
Model Small Set Medium Set Large Set Overall Llama-3.1-TAIDE-LX-8B-Chat 21.62 15.40 10.64 15.944 Llama3-TAIDE-LX-8B-Chat-Alpha1 7.32 0.00 0.00 2.367 meta-llama/Llama-3.1-8B-Instruct 33.51 22.49 19.24 25.310
- Evaluation data
授權條款
免責聲明
- LLM 模型由於設計架構的限制,以及資料難免有偏誤,語言模型的任何回應不代表 TAIDE 立場,使用前需要額外加入安全防護機制,且回應內容也可能包含不正確的資訊,使用者請勿盡信。