LLM360
/

K2-Chat

@@ -2,7 +2,52 @@
 license: apache-2.0
 ---
 # K2-Chat: a fully-reproducible large language model outperforming Llama 2 70B Chat using 35% less compute
 K2 Chat is finetuned from [K2-65B](https://huggingface.co/LLM360/K2). K2 Chat outperforms Llama 2-70B-Chat on all evaluations conducted. The model also outperforms Llama 3-70B-Instruct on coding tasks.
 <center><img src="k2_chat_eval_table.png" alt="k2 eval table" /></center>

 license: apache-2.0
 ---
 # K2-Chat: a fully-reproducible large language model outperforming Llama 2 70B Chat using 35% less compute
+K2 Chat is finetuned from [K2-65B](https://huggingface.co/LLM360/K2). The most recent model update 10/31/24.
+In this release, we introduce function calling features and target improvements across math, coding, and safety.
+We utilized the following datasets:
+[Infinity-Instruct](https://huggingface.co/datasets/BAAI/Infinity-Instruct)
+[JiuZhang3.0-Corpus-SFT](https://huggingface.co/datasets/ToheartZhang/JiuZhang3.0-Corpus-SFT)
+[glaive-function-calling-v2-sharegpt](https://huggingface.co/datasets/hiyouga/glaive-function-calling-v2-sharegpt)
+## Results
+|                         | K2-Chat-060124 | K2-Chat |
+|-------------------------|---------|----------|
+| **Natural Language Benchmarks** |         |          |
+| MMLU (0-shot)           | 63.5    | 69.14    |
+| RACE (0-shot)           | 46.1    | 46.60    |
+| HellaSwag (10-shot)     | 81.7    | 80.80    |
+| PIQA (5-shot)           | 82.3    | 81.34    |
+| ARC-easy (5-shot)       | 84.6    | 79.00    |
+| ARC-challenge (25-shot) | 61.3    | 61.09    |
+| OpenBookQA (5-shot)     | 48.0    | 47.00    |
+| Winogrande (5-shot)     | 79.5    | 78.30    |
+| TruthfulQA (0-shot)     | 44.7    | 57.32    |
+| CrowS-Pairs (0-shot)    | 64.2    | 65.32    |
+| GSM8K (5-shot)          | 60.7    | 77.10    |
+| MathQA (5-shot)         | 44.8    | 43.12    |
+| LogiQA2.0 (0-shot)      | 38.0    | 36.83    |
+| BBH CoT (0-shot)        | 64.9    | 70.37    |
+| **Code Benchmarks**     |         |          |
+| HumanEval (pass@1)      | 47.9    | 71.20    |
+| **Domain Specific (Medical)** |   |          |
+| MedQA (0-shot)          | 53.6    | 52.87    |
+| MedMCQA (5-shot)        | 51.3    | 50.71    |
+| PubMedQA (0-shot)       | 75.0    | 71.20    |
+| **Other**               |         |          |
+| MT-Bench               | 6.87     | 7.55     |
+| JSON-Mode-Eval          | 77.21   | 90.09    |
+| **Overall Average Score**|         |          |
+| Avg Score               | 58.88   | 61.30    |
+## K2-Chat-060124
 K2 Chat is finetuned from [K2-65B](https://huggingface.co/LLM360/K2). K2 Chat outperforms Llama 2-70B-Chat on all evaluations conducted. The model also outperforms Llama 3-70B-Instruct on coding tasks.
 <center><img src="k2_chat_eval_table.png" alt="k2 eval table" /></center>