Adding Evaluation Results

#1
Files changed (1) hide show
  1. README.md +114 -2
README.md CHANGED
@@ -2,7 +2,7 @@
2
  tags:
3
  - long-cot-reasoning
4
  - transformers
5
- - mamba2 # Consider updating if this isn't the architecture
6
  - llms
7
  - chain-of-thought
8
  license: apache-2.0
@@ -15,6 +15,105 @@ base_model:
15
  - Qwen/Qwen2.5-14B-Instruct
16
  pipeline_tag: text-generation
17
  library_name: transformers
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
  ---
19
 
20
  ![Sphinx of Reasoning](./image.webp)
@@ -69,4 +168,17 @@ Sphinx is a cutting-edge Long Chain-of-Thought (CoT) reasoning model meticulousl
69
  - **Advanced Academic Research:** Generating in-depth, logically structured analyses for complex scientific and philosophical inquiries.
70
  - **Robust Legal Reasoning Assistance:** Constructing and articulating multi-step legal arguments with precision and logical rigor.
71
  - **Transformative STEM Education:** Guiding learners through intricate mathematical and logical problems with clear, step-by-step explanations.
72
- - **Transparent Cognitive AI Systems:** Powering AI systems where explainability and logical justification are paramount for decision-making.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  tags:
3
  - long-cot-reasoning
4
  - transformers
5
+ - mamba2
6
  - llms
7
  - chain-of-thought
8
  license: apache-2.0
 
15
  - Qwen/Qwen2.5-14B-Instruct
16
  pipeline_tag: text-generation
17
  library_name: transformers
18
+ model-index:
19
+ - name: Sphinx2.0
20
+ results:
21
+ - task:
22
+ type: text-generation
23
+ name: Text Generation
24
+ dataset:
25
+ name: IFEval (0-Shot)
26
+ type: wis-k/instruction-following-eval
27
+ split: train
28
+ args:
29
+ num_few_shot: 0
30
+ metrics:
31
+ - type: inst_level_strict_acc and prompt_level_strict_acc
32
+ value: 71.23
33
+ name: averaged accuracy
34
+ source:
35
+ url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=Daemontatox%2FSphinx2.0
36
+ name: Open LLM Leaderboard
37
+ - task:
38
+ type: text-generation
39
+ name: Text Generation
40
+ dataset:
41
+ name: BBH (3-Shot)
42
+ type: SaylorTwift/bbh
43
+ split: test
44
+ args:
45
+ num_few_shot: 3
46
+ metrics:
47
+ - type: acc_norm
48
+ value: 49.4
49
+ name: normalized accuracy
50
+ source:
51
+ url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=Daemontatox%2FSphinx2.0
52
+ name: Open LLM Leaderboard
53
+ - task:
54
+ type: text-generation
55
+ name: Text Generation
56
+ dataset:
57
+ name: MATH Lvl 5 (4-Shot)
58
+ type: lighteval/MATH-Hard
59
+ split: test
60
+ args:
61
+ num_few_shot: 4
62
+ metrics:
63
+ - type: exact_match
64
+ value: 2.72
65
+ name: exact match
66
+ source:
67
+ url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=Daemontatox%2FSphinx2.0
68
+ name: Open LLM Leaderboard
69
+ - task:
70
+ type: text-generation
71
+ name: Text Generation
72
+ dataset:
73
+ name: GPQA (0-shot)
74
+ type: Idavidrein/gpqa
75
+ split: train
76
+ args:
77
+ num_few_shot: 0
78
+ metrics:
79
+ - type: acc_norm
80
+ value: 5.82
81
+ name: acc_norm
82
+ source:
83
+ url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=Daemontatox%2FSphinx2.0
84
+ name: Open LLM Leaderboard
85
+ - task:
86
+ type: text-generation
87
+ name: Text Generation
88
+ dataset:
89
+ name: MuSR (0-shot)
90
+ type: TAUR-Lab/MuSR
91
+ args:
92
+ num_few_shot: 0
93
+ metrics:
94
+ - type: acc_norm
95
+ value: 13.05
96
+ name: acc_norm
97
+ source:
98
+ url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=Daemontatox%2FSphinx2.0
99
+ name: Open LLM Leaderboard
100
+ - task:
101
+ type: text-generation
102
+ name: Text Generation
103
+ dataset:
104
+ name: MMLU-PRO (5-shot)
105
+ type: TIGER-Lab/MMLU-Pro
106
+ config: main
107
+ split: test
108
+ args:
109
+ num_few_shot: 5
110
+ metrics:
111
+ - type: acc
112
+ value: 46.49
113
+ name: accuracy
114
+ source:
115
+ url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=Daemontatox%2FSphinx2.0
116
+ name: Open LLM Leaderboard
117
  ---
118
 
119
  ![Sphinx of Reasoning](./image.webp)
 
168
  - **Advanced Academic Research:** Generating in-depth, logically structured analyses for complex scientific and philosophical inquiries.
169
  - **Robust Legal Reasoning Assistance:** Constructing and articulating multi-step legal arguments with precision and logical rigor.
170
  - **Transformative STEM Education:** Guiding learners through intricate mathematical and logical problems with clear, step-by-step explanations.
171
+ - **Transparent Cognitive AI Systems:** Powering AI systems where explainability and logical justification are paramount for decision-making.# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard)
172
+ Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/Daemontatox__Sphinx2.0-details)!
173
+ Summarized results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/contents/viewer/default/train?q=Daemontatox%2FSphinx2.0&sort[column]=Average%20%E2%AC%86%EF%B8%8F&sort[direction]=desc)!
174
+
175
+ | Metric |Value (%)|
176
+ |-------------------|--------:|
177
+ |**Average** | 31.45|
178
+ |IFEval (0-Shot) | 71.23|
179
+ |BBH (3-Shot) | 49.40|
180
+ |MATH Lvl 5 (4-Shot)| 2.72|
181
+ |GPQA (0-shot) | 5.82|
182
+ |MuSR (0-shot) | 13.05|
183
+ |MMLU-PRO (5-shot) | 46.49|
184
+