tokyotech-llm
/

Swallow-7b-hf

@@ -34,8 +34,7 @@ We released the 7B and 70B models without vocabulary expansion on January 26th,
 ![logo](./logo.png)
 This repository provides large language models developed by [TokyoTech-LLM](https://tokyotech-llm.github.io/).
-Read our [blog post](https://zenn.dev/tokyotech_lm/articles/d6cb3a8fdfc907) or our paper (preprint coming soon) for more details!
 ## Model Details
@@ -47,7 +46,7 @@ Read our [blog post](https://zenn.dev/tokyotech_lm/articles/d6cb3a8fdfc907) or o
 ## Base Model Performance
-### Japanese version
 |Model|Size|JCommonsenseQA|JEMHopQA|NIILC|JSQuAD|XL-Sum|MGSM|WMT20-en-ja|WMT20-ja-en|
 |---|---|---|---|---|---|---|---|---|---|
@@ -62,7 +61,7 @@ Read our [blog post](https://zenn.dev/tokyotech_lm/articles/d6cb3a8fdfc907) or o
 | Llama 2 | 70B | 0.8686 | 0.4656 | 0.5256 | 0.9080 | 0.2361 | 0.3560 | 0.2643 | **0.2398** |
 | Swallow | 70B | 0.9348 | **0.6290** | 0.6960 | 0.9176 | 0.2266 | **0.4840** | **0.3043** | 0.2298 |
 | Swallow-NVE | 70B | **0.9410** | 0.5759 | **0.7024** | **0.9254** | **0.2758** | 0.4720 | 0.3042 | 0.2322 |
-### English version
 |Model|Size|OpenBookQA|TriviaQA|HellaSwag|SQuAD2.0|XWINO|GSM8K|
 |---|---|---|---|---|---|---|---|
@@ -78,6 +77,33 @@ Read our [blog post](https://zenn.dev/tokyotech_lm/articles/d6cb3a8fdfc907) or o
 | Swallow | 70B   | 0.4220     | 0.7756   | 0.6458    | 0.3745   | 0.9204 | 0.4867 |
 | Swallow-NVE | 70B | 0.4240     | 0.7817   | 0.6439    | 0.3451   | 0.9256 | 0.4943 |
 ## Usage
 First install additional dependencies in [requirements.txt](./requirements.txt):

 ![logo](./logo.png)
 This repository provides large language models developed by [TokyoTech-LLM](https://tokyotech-llm.github.io/).
+Read our [blog post](https://zenn.dev/tokyotech_lm/articles/d6cb3a8fdfc907) or our [paper](https://www.anlp.jp/proceedings/annual_meeting/2024/pdf_dir/A8-5.pdf)
 ## Model Details
 ## Base Model Performance
+### Japanese tasks
 |Model|Size|JCommonsenseQA|JEMHopQA|NIILC|JSQuAD|XL-Sum|MGSM|WMT20-en-ja|WMT20-ja-en|
 |---|---|---|---|---|---|---|---|---|---|
 | Llama 2 | 70B | 0.8686 | 0.4656 | 0.5256 | 0.9080 | 0.2361 | 0.3560 | 0.2643 | **0.2398** |
 | Swallow | 70B | 0.9348 | **0.6290** | 0.6960 | 0.9176 | 0.2266 | **0.4840** | **0.3043** | 0.2298 |
 | Swallow-NVE | 70B | **0.9410** | 0.5759 | **0.7024** | **0.9254** | **0.2758** | 0.4720 | 0.3042 | 0.2322 |
+### English tasks
 |Model|Size|OpenBookQA|TriviaQA|HellaSwag|SQuAD2.0|XWINO|GSM8K|
 |---|---|---|---|---|---|---|---|
 | Swallow | 70B   | 0.4220     | 0.7756   | 0.6458    | 0.3745   | 0.9204 | 0.4867 |
 | Swallow-NVE | 70B | 0.4240     | 0.7817   | 0.6439    | 0.3451   | 0.9256 | 0.4943 |
+## Evaluation Benchmarks
+### Japanese evaluation benchmarks
+We used llm-jp-eval(v1.0.0) and JP Language Model Evaluation Harness(commit #9b42d41). The details are as follows:
+- Multiple-choice question answering (JCommonsenseQA [Kurihara+, 2022])
+- Open-ended question answering (JEMHopQA [Ishii+, 2023])
+- Open-ended question answering (NIILC [Sekine, 2003])
+- Machine reading comprehension (JSQuAD [Kurihara+, 2022])
+- Automatic summarization (XL-Sum [Hasan+, 2021])
+- Machine translation (WMT2020 ja-en [Barrault+, 2020])
+- Machine translation (WMT2020 en-ja [Barrault+, 2020])
+- Mathematical reasoning (MGSM [Shi+, 2023])
+### English evaluation benchmarks
+We used the Language Model Evaluation Harness(v.0.3.0). The details are as follows:
+- Multiple-choice question answering (OpenBookQA [Mihaylov+, 2018])
+- Open-ended question answering (TriviaQA [Joshi+, 2017])
+- Machine reading comprehension (SQuAD 2.0 [Rajpurkar+, 2018])
+- Commonsense reasoning (XWINO [Tikhonov & Ryabinin, 2021])
+- Natural language inference (HellaSwag [Zellers+, 2019])
+- Mathematical reasoning (GSM8k [Cobbe+, 2021])
 ## Usage
 First install additional dependencies in [requirements.txt](./requirements.txt):