tokyotech-llm
/

Llama-3.1-Swallow-70B-v0.1

+---
+language:
+  - en
+  - ja
+library_name: transformers
+pipeline_tag: text-generation
+license: llama3.1
+model_type: llama
+---
+# Llama3.1 Swallow
+Our Swallow model has undergone continual pre-training from the [Llama 3.1 family](https://huggingface.co/collections/meta-llama/llama-31-669fc079a0c406a149a5738f), primarily with the addition of Japanese language data. The Instruct versions use supervised fine-tuning (SFT). Links to other models can be found in the index.
+# Model Release Updates
+We are excited to share the release schedule for our latest models:
+- **October 08, 2024**: Released the [Llama-3.1-Swallow-8B-v0.1](https://huggingface.co/tokyotech-llm/Llama-3.1-Swallow-8B-v0.1), [Llama-3.1-Swallow-8B-Instruct-v0.1](https://huggingface.co/tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.1), [Llama-3.1-Swallow-70B-v0.1](https://huggingface.co/tokyotech-llm/Llama-3.1-Swallow-70B-v0.1), and [Llama-3.1-Swallow-70B-Instruct-v0.1](https://huggingface.co/tokyotech-llm/Llama-3.1-Swallow-70B-Instruct-v0.1).
+## Swallow Model Index
+|Model|Llama-3.1-Swallow|Llama-3.1-Swallow-Instruct|
+|---|---|---|
+|8B| [Link](https://huggingface.co/tokyotech-llm/Llama-3.1-Swallow-8B-v0.1) | [Link](https://huggingface.co/tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.1) |
+|70B| [Link](https://huggingface.co/tokyotech-llm/Llama-3.1-Swallow-70B-v0.1) | [Link](https://huggingface.co/tokyotech-llm/Llama-3.1-Swallow-70B-Instruct-v0.1) |
+![logo](./logo.png)
+This repository provides large language models developed by [Swallow-LLM](https://swallow-llm.github.io/).
+## Model Details
+* **Model type**: Please refer to [Llama 3.1 MODEL_CARD](https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md) for details on the model architecture.
+* **Language(s)**: Japanese English
+* **Library**: [Megatron-LM](https://github.com/NVIDIA/Megatron-LM)
+* **Tokenizer**: Please refer to [Llama 3.1 blog](https://ai.meta.com/blog/meta-llama-3-1) for details on the tokenizer.
+* **Contact**: swallow[at]nlp.c.titech.ac.jp
+## Model Performance
+### Japanese tasks
+|Model|JCom.|JEMHopQA|NIILC|JSQuAD|XL-Sum|MGSM|WMT20-en-ja|WMT20-ja-en|JMMLU|JHumanEval|Ja Avg|
+|---|---|---|---|---|---|---|---|---|---|---|---|
+|   |4-shot|4-shot|4-shot|4-shot|1-shot|4-shot|4-shot|4-shot|5-shot|0-shot|   |
+|   |EM acc|Char-F1|Char-F1|Char-F1|ROUGE-2|EM acc|BLEU|BLEU|EM acc|pass@1|   |
+| Qwen2-72B | 0.9607 | 0.6399 | 0.5617 | 0.9261 | 0.2362 | 0.7560 | 0.2747 | 0.2419 | 0.7831 | 0.5567 | 0.5937 |
+| Qwen2.5-72B | **0.9723** | 0.6111 | 0.6194 | **0.9301** | **0.2792** | **0.8280** | 0.2869 | 0.2521 | **0.8046** | **0.6482** | **0.6232** |
+| Sarashina2-70B | 0.9285 | **0.7173** | **0.6681** | 0.9294 | 0.1899 | 0.4880 | 0.3129 | 0.2429 | 0.5916 | 0.2384 | 0.5307 |
+| Llama 3 70B | 0.9473 | 0.6042 | 0.5965 | 0.9207 | 0.2254 | 0.6720 | 0.2855 | 0.2526 | 0.6975 | 0.4799 | 0.5682 |
+| Llama 3.1 70B | 0.9482 | 0.6112 | 0.5968 | 0.9251 | 0.2284 | 0.6840 | 0.2870 | 0.2553 | 0.6690 | 0.4573 | 0.5662 |
+| Llama 3 Youko 70B | 0.9455 | 0.6088 | 0.6068 | 0.9226 | 0.2428 | 0.6680 | 0.2909 | 0.2495 | 0.7038 | 0.4530 | 0.5692 |
+| Llama 3 Swallow 70B | 0.9714 | 0.6695 | 0.6881 | 0.9218 | 0.2404 | 0.7080 | 0.3072 | 0.2548 | 0.7049 | 0.4683 | 0.5934 |
+| Llama 3.1 Swallow 70B | 0.9553 | 0.6450 | 0.6776 | 0.9231 | 0.2722 | 0.6840 | **0.3199** | **0.2591** | 0.7088 | 0.4872 | 0.5932 |
+### English tasks
+|Model|OpenBookQA|TriviaQA|HellaSWAG|SQuAD2.0|XWINO|MMLU|GSM8K|BBH|HumanEval|En Avg|
+|---|---|---|---|---|---|---|---|---|---|---|
+|   |4-shot|4-shot|4-shot|4-shot|4-shot|5-shot|4-shot|3-shot|0-shot|   |
+|   |Acc|EM acc|Acc|EM acc|Acc|Acc|EM acc|CoT EM Acc|pass@1|   |
+| Qwen2-72B | 0.4160 | **0.7890** | 0.6766 | **0.4052** | 0.9161 | 0.8428 | **0.8908** | 0.6388 | **0.6049** | **0.6867** |
+| Qwen2.5-72B | 0.4160 | 0.7604 | **0.6849** | 0.3997 | 0.9015 | **0.8608** | 0.8726 | **0.7268** | 0.5543 | 0.6863 |
+| Sarashina2-70B | 0.3920 | 0.5373 | 0.6270 | 0.4174 | **0.9178** | 0.6303 | 0.0106 | 0.6386 | 0.2799 | 0.4945 |
+| Llama 3 70B | 0.4360 | 0.8263 | 0.6909 | 0.4071 | 0.9213 | 0.7870 | 0.8014 | 0.8266 | 0.5177 | 0.6905 |
+| Llama 3.1 70B | **0.4420** | 0.8288 | 0.6898 | 0.4050 | 0.9196 | 0.7846 | 0.7991 | 0.6566 | 0.5476 | 0.6748 |
+| Llama 3 Youko 70B | 0.4300 | 0.8291 | 0.6900 | 0.4057 | 0.9222 | 0.7862 | 0.7968 | 0.8275 | 0.4128 | 0.6778 |
+| Llama 3 Swallow 70B | 0.4240 | 0.8231 | 0.6828 | 0.4059 | 0.9234 | 0.7745 | 0.8143 | 0.7352 | 0.4909 | 0.6749 |
+| Llama 3.1 Swallow 70B | 0.4320 | 0.8262 | 0.6898 | 0.4018 | 0.9277 | 0.7724 | 0.8089 | 0.8063 | 0.5396 | 0.6894 |
+## Evaluation Benchmarks
+### Japanese evaluation benchmarks
+We used llm-jp-eval(v1.3.0), JP Language Model Evaluation Harness(commit #9b42d41) and Code Generation LM Evaluation Harness(commit #0261c52). The details are as follows:
+- Multiple-choice question answering (JCommonsenseQA [Kurihara et al., 2022])
+- Open-ended question answering (JEMHopQA [Ishii et al., 2024])
+- Open-ended question answering (NIILC [関根, 2003])
+- Machine reading comprehension (JSQuAD [Kurihara et al., 2022])
+- Automatic summarization (XL-Sum [Hasan et al., 2021])
+- Machine translation (WMT2020 ja-en [Barrault et al., 2020])
+- Machine translation (WMT2020 en-ja [Barrault et al., 2020])
+- Mathematical reasoning (MGSM [Shi et al., 2023])
+- Academic exams (JMMLU [尹ら, 2024])
+- Code generation (JHumanEval [佐藤ら, 2024])
+### English evaluation benchmarks
+We used the Language Model Evaluation Harness(v.0.4.2) and Code Generation LM Evaluation Harness(commit #0261c52). The details are as follows:
+- Multiple-choice question answering (OpenBookQA [Mihaylov et al., 2018])
+- Open-ended question answering (TriviaQA [Joshi et al., 2017])
+- Machine reading comprehension (SQuAD2 [Rajpurkar et al., 2018])
+- Commonsense reasoning (XWINO [Tikhonov and Ryabinin, 2021])
+- Natural language inference (HellaSwag [Zellers et al., 2019])
+- Mathematical reasoning (GSM8K [Cobbe et al., 2021])
+- Reasoning (BBH (BIG-Bench-Hard) [Suzgun et al., 2023])
+- Academic exams (MMLU [Hendrycks et al., 2021])
+- Code generation (HumanEval [Chen et al., 2021])
+## Training Datasets
+### Continual Pre-Training
+The following datasets were used for continual pre-training.
+- [Cosmopedia](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia)
+- [Dclm-baseline-1.0](https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0)
+- [English Wikipedia](https://dumps.wikimedia.org/other/cirrussearch)
+- [Japanese Wikipedia](https://dumps.wikimedia.org/other/cirrussearch)
+- [Laboro ParaCorpus](https://github.com/laboroai/Laboro-ParaCorpus)
+- [Swallow Corpus](https://arxiv.org/abs/2404.17733)
+- [The-stack-v2](https://huggingface.co/datasets/bigcode/the-stack-v2-train-smol-ids)
+## Risks and Limitations
+The models released here are still in the early stages of our research and development and have not been tuned to ensure outputs align with human intent and safety considerations.
+## Acknowledgements
+We thank Meta Research for releasing Llama 3.1 under an open license for others to build on.
+Our project is supported by the [Large Generative AI Development Support Program](https://abci.ai/en/link/lfm_support_program.html) of the National Institute of Advanced Industrial Science and Technology.
+## License
+[META LLAMA 3.1 COMMUNITY LICENSE](https://www.llama.com/llama3_1/license/)
+## Authors
+Here are the team members:
+- From [Tokyo Institute of Technology Okazaki Laboratory](https://www.nlp.c.titech.ac.jp/index.en.html), the following members:
+  - [Naoaki Okazaki](https://www.chokkan.org/index.ja.html)
+  - [Sakae Mizuki](https://s-mizuki-nlp.github.io/)
+  - [Youmi Ma](https://www.nlp.c.titech.ac.jp/member/youmi.en.html)
+  - [Koki Maeda](https://sites.google.com/view/silviase)
+  - [Kakeru Hattori](https://aya-se.vercel.app/)
+  - [Masanari Ohi](https://sites.google.com/view/masanariohi)
+  - [Taihei Shiotani](https://github.com/inatoihs)
+  - [Koshiro Saito](https://sites.google.com/view/koshiro-saito)
+- From [Tokyo Institute of Technology YOKOTA Laboratory](https://www.rio.gsic.titech.ac.jp/en/index.html), the following members:
+  - [Rio Yokota](https://twitter.com/rioyokota)
+  - [Kazuki Fujii](https://twitter.com/okoge_kaz)
+  - [Taishi Nakamura](https://twitter.com/Setuna7777_2)
+  - [Takumi Okamoto](https://www.linkedin.com/in/takumi-okamoto)
+  - [Ishida Shigeki](https://www.wantedly.com/id/reborn27)
+- From [Artificial Intelligence Research Center, AIST, Japan](https://www.airc.aist.go.jp/en/teams/), the following members:
+  - [Hiroya Takamura](https://sites.google.com/view/hjtakamura)
+## How to cite
+If you find our work helpful, please feel free to cite us.
+```
+@inproceedings{Fujii:COLM2024,
+   title={Continual Pre-Training for Cross-Lingual LLM Adaptation:
+Enhancing Japanese Language Capabilities},
+   author={Kazuki Fujii and Taishi Nakamura and Mengsay Loem and Hiroki
+Iida and Masanari Ohi and Kakeru Hattori and Hirai Shota and Sakae
+Mizuki and Rio Yokota and Naoaki Okazaki},
+   booktitle="Proceedings of the First Conference on Language Modeling",
+   series={COLM},
+   pages="(to appear)",
+   year="2024",
+   month=oct,
+   address={University of Pennsylvania, USA},
+}
+@inproceedings{Okazaki:COLM2024,
+   title={Building a Large Japanese Web Corpus for Large Language Models},
+   author={Naoaki Okazaki and Kakeru Hattori and Hirai Shota and Hiroki
+Iida and Masanari Ohi and Kazuki Fujii and Taishi Nakamura and Mengsay
+Loem and Rio Yokota and Sakae Mizuki},
+   booktitle="Proceedings of the First Conference on Language Modeling",
+   series={COLM},
+   pages="(to appear)",
+   year="2024",
+   month=oct,
+   address={University of Pennsylvania, USA},
+}
+```
+### Citations
+```tex
+@misc{dubey2024llama3herdmodels,
+      title={The Llama 3 Herd of Models},
+      author={Abhimanyu Dubey and Abhinav Jauhri and Abhinav Pandey and Abhishek Kadian and Ahmad Al-Dahle and Aiesha Letman and Akhil Mathur and Alan Schelten and Amy Yang and Angela Fan et al.},
+      year={2024},
+      eprint={2407.21783},
+      archivePrefix={arXiv},
+      primaryClass={cs.AI},
+      url={https://arxiv.org/abs/2407.21783},
+}
+```