Adding Evaluation Results

#2
Files changed (1) hide show
  1. README.md +14 -1
README.md CHANGED
@@ -115,4 +115,17 @@ state of the art, but rather further show that chat-like behaviors in LLMs can b
115
  *DLite is an experimental technology and is not designed for use in any environment without significant testing and safety consideration.
116
  Furthermore, the model can sometimes exhibit undesired behaviors. Some of these behaviors include, but are not limited to: factual
117
  inaccuracies, biases, offensive responses, toxicity, and hallucinations. Just as with any other LLM, we advise users of this technology
118
- to exercise good judgment when applying this technology.*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
115
  *DLite is an experimental technology and is not designed for use in any environment without significant testing and safety consideration.
116
  Furthermore, the model can sometimes exhibit undesired behaviors. Some of these behaviors include, but are not limited to: factual
117
  inaccuracies, biases, offensive responses, toxicity, and hallucinations. Just as with any other LLM, we advise users of this technology
118
+ to exercise good judgment when applying this technology.*
119
+ # [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
120
+ Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_aisquared__dlite-v2-124m)
121
+
122
+ | Metric | Value |
123
+ |-----------------------|---------------------------|
124
+ | Avg. | 25.01 |
125
+ | ARC (25-shot) | 23.98 |
126
+ | HellaSwag (10-shot) | 31.1 |
127
+ | MMLU (5-shot) | 25.29 |
128
+ | TruthfulQA (0-shot) | 38.98 |
129
+ | Winogrande (5-shot) | 50.43 |
130
+ | GSM8K (5-shot) | 0.0 |
131
+ | DROP (3-shot) | 5.29 |