File size: 10,569 Bytes
efeee6d
314f91a
95f85ed
efeee6d
 
 
 
 
 
314f91a
b899767
 
efeee6d
943f952
ec593a8
a34ee6f
113a50c
e86635f
4e57759
1ffc326
b899767
 
efeee6d
 
 
a18789d
58733e4
efeee6d
8c49cb6
54bd295
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0227006
 
efeee6d
0227006
d313dbd
 
54bd295
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13aef64
54bd295
 
13aef64
54bd295
 
13aef64
54bd295
 
13aef64
54bd295
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0405ac5
54bd295
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d16cee2
d313dbd
 
8c49cb6
54bd295
 
 
 
d313dbd
 
 
 
 
 
 
 
 
8c49cb6
b323764
d313dbd
 
 
 
 
 
 
 
b323764
d313dbd
 
 
 
8c49cb6
 
54bd295
9833cdb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
from dataclasses import dataclass
from enum import Enum

@dataclass
class Task:
    benchmark: str
    metric: str
    col_name: str


# Select your tasks here
# ---------------------------------------------------
class Tasks(Enum):
    # task_key in the json file, metric_key in the json file, name to display in the leaderboard 
    task0 = Task("custom|heq-qa-tlnls|0", "heq_tlnls", "QA TLNLS (HeQ)")
    task1 = Task("custom|sentiment-acc|0", "sentiment_acc", "Sentiment Acc (Mafat)")
    task2 = Task("custom|winograd-acc|0", "winograd_acc", "Winograd (Binary) Acc (V. Schwartz)")
    task3 = Task("custom|he-en-trans-bleu|0", "sentence_bleu", "Translation BLEU")
    
NUM_FEWSHOT = 0 # Change with your few shot
# ---------------------------------------------------



# Your leaderboard name
TITLE = """<h1 align="center" id="space-title">Hebrew LLM Leaderboard</h1>"""

# What does your leaderboard evaluate?
INTRODUCTION_TEXT = """
<div style="display: flex; justify-content: center;">
<div style="max-width: 70vw;">

Welcome to the Leaderboard for open Hebrew LLMs. The leaderboard ranks the different models according to their success on various tasks on Hebrew. 

The leaderboard was created and is operated by a collaboration of [Mafat / The Israeli National Program for NLP in Hebrew and Arabic](https://nnlp-il.mafat.ai/) and [DICTA: The Israel Center for Text Analysis](https://dicta.org.il/). 

<div dir="rtl" style="text-align: right">

讘专讜讻讬诐 讛讘讗讬诐 诇诇讜讞 讛转讜爪讗讜转 砖诇 诪讜讚诇讬 LLM 驻转讜讞讬诐 讘注讘专讬转. 诇讜讞 讛转讜爪讗讜转 诪讚专讙 讗转 讛诪讜讚诇讬诐 讛砖讜谞讬诐 诇驻讬 讛爪诇讞转诐 讘诪砖讬诪讜转 砖讜谞讜转 讘注讘专讬转. 

诇讜讞 讛转讜爪讗讜转 谞讜爪专 讜诪转讜驻注诇 注诇 讬讚讬 砖讬转讜祝 驻注讜诇讛 讘讬谉 [诪驻讗"转 / 讛转讜讻谞讬转 讛诇讗讜诪讬转 讛讬砖专讗诇讬转 诇-NLP 讘注讘专讬转 讜讘注专讘讬转](https://nnlp-il.mafat.ai/) 讜[讚讬拽讟讛: 讛诪专讻讝 讛讬砖专讗诇讬 诇谞讬转讜讞 讟拽住讟讬诐](https://dicta.org.il/)

</div>

<div style="display: flex; flex-direction: row; justify-content: space-around; align-items: center" dir="ltr">
  <a href="https://dicta.org.il/">
    <img src="file/logos/dicta-logo.jpg" alt="Dicta Logo" style="max-height: 65px">
  </a>
  <a href="https://nnlp-il.mafat.ai/">
    <img src="file/logos/mafat-logo.jpg" alt="Mafat Logo" style="max-height: 100px">
  </a>
</div>
</div>
</div>
"""

# Which evaluations are you running? how can people reproduce what you have?
LLM_BENCHMARKS_TEXT = f"""
## How it works

We have curated 4 datasets for benchmarking the quality of the LLMs in Hebrew. All of the benchmarks test the base model using a few-shot prompt. Note that the tests specifically evaluate the model's abilities regarding Hebrew, without regard for the capabilities of the model in other languages.

1. QA TLNLS (HeQ)

    - **Source**: We use the test subset of the HeQ dataset, released by Amir Cohen [here](https://aclanthology.org/2023.findings-emnlp.915/). Data can be found [here](https://github.com/NNLP-IL/Hebrew-Question-Answering-Dataset). 
    
    - **Scoring**: We score the results using the `tlnls` scoring method proposed in the paper released with HeQ, which accounts for the linguistic properties of Hebrew language. 
    
    - **Number of examples**: 1,436 prompts. 
    
    - **Few-Shot Format**: For every context paragraph in the dataset, the few-shot prompt is formatted with the context paragraph, followed by 3 questions and answers on that paragraph, and finally with the desired question unanswers.
    
    For example:

<blockquote dir="rtl" style='text-align: right; background-color: #f0f0f0;'>
<p>讘砖谞转 2012, 讛转诪讜讚讚讛 诇专讗砖讜谞讛 讘驻专讬讬诪专讬讝 砖诇 诪驻诇讙转 讛注讘讜讚讛 诇拽专讗转 讛讘讞讬专讜转 诇讻谞住转 讛转砖注 注砖专讛 讜讛讙讬注讛 诇诪拽讜诐 讛志36 讘专砖讬诪讛 讛讗专爪讬转 (讛讘讟讞转 讬讬爪讜讙 诇讗讬砖讛). 讘志2015 诇拽专讗转 讛讘讞讬专讜转 诇讻谞住转 讛注砖专讬诐, 讛转诪讜讚讚讛 讜专讘讬谉 讘驻专讬讬诪专讬讝 砖诇 诪驻诇讙转 讛注讘讜讚讛 讜讛讜爪讘讛 讘诪拽讜诐 讛-22 讘专砖讬诪转 讛诪讞谞讛 讛爪讬讜谞讬 诇讻谞住转, 讗砖专 砖讜专讬讬谉 诇讗讬砖讛 讜谞讘讞专讛 诇讻谞住转. 讘砖谞转 讛讻讛讜谞讛 讛专讗砖讜谞讛 砖诇讛 讘讻谞住转, 讛注谞讬拽 诇讛 讛诪讻讜谉 讛讬砖专讗诇讬 诇讚诪讜拽专讟讬讛 讗转 讗讜转 讛驻专诇诪谞讟专 讛诪爪讟讬讬谉 诇砖谞转 2016. 讞讘专讛 讘讜讜注讚转 讛讞讜抓 讜讘讬讟讞讜谉, 砖诐 讛讬讗 讞讘专讛 讘讜讜注讚转 讛诪砖谞讛 诇讻讜讞 讗讚诐. 讬讝诪讛 讜讬砖讘讛 讘专讗砖 讜讜注讚转 讛诪砖谞讛 诇讘讞讬谞转 诪砖拽 讛讗砖专讗讬 讘讬砖专讗诇. 讬讝诪讛 讜讞讘专讛 讘讜讜注讚转 讛讞拽讬专讛 讛驻专诇诪谞讟专讬转 诇讘讞讬谞转 诪砖拽 讛讗砖专讗讬 讘讬砖专讗诇, 讜讻谉 讞讘专讛 讘讜讜注讚转 讛讻诇讻诇讛, 讜讜注讚转 讛讻谞住转 讜讛讜讜注讚讛 讛诪讬讜讞讚转 诇讝讻讜讬讜转 讛讬诇讚, 讜讘讜讜注讚转 讛诪砖谞讛 诇拽讬讚讜诐 注住拽讬诐 拽讟谞讬诐 讜讘讬谞讜谞讬讬诐</p>

砖讗诇讛: 讘讗讬讝讛 驻专住 讝讻转讛 讜专讘讬谉? <br/>
转砖讜讘讛: 讗讜转 讛驻专诇诪谞讟专 讛诪爪讟讬讬谉 诇砖谞转 2016

砖讗诇讛: 诪讬 诪注谞讬拽 讗转 讗讜转 讛驻专诇诪谞讟专 讛诪爪讟讬讬谉?<br/>
转砖讜讘讛: 讛诪讻讜谉 讛讬砖专讗诇讬 诇讚诪讜拽专讟讬讛

砖讗诇讛: 诪转讬 讛转拽讬讬诪讜 讛讘讞讬专讜转 诇讻谞住转 讛注砖专讬诐? <br/>
转砖讜讘讛: 讘志2015

砖讗诇讛: 诇讗讬讝讜 讻谞住转 谞讻谞住讛 讜专讘讬谉 诇专讗砖讜谞讛? <br/>
转砖讜讘讛: 
</blockquote>

2. Sentiment Acc (Mafat)

    
    - **Source**: We use a test subset of an early version of the Hebrew Sentiment dataset, released by Mafat \& NNLP-IL [here](https://www.facebook.com/groups/MDLI1/permalink/2681774131986618/). The latest version of the data can be found [here](https://github.com/NNLP-IL/Hebrew-Question-Answering-Dataset) (albeit it is different than the data we used). 
    
    - **Scoring**: We compute the accuracy score on the predictions, expecting either "讞讬讜讘讬", "砖诇讬诇讬", or "谞讟专诇讬".
    
    - **Number of examples**: 3,000 examples, 1,000 from each category. These examples were selected by a linguist tagger. 
    
    - **Few-Shot Format**: For every prompt, we provide 9 few-shot examples, 3 from each category, randomly shuffled.
    
    For example:

<blockquote dir="rtl" style='text-align: right; background-color: #f0f0f0'>
<p>
诪砖驻讟: 诪砖驻讟 讞讬讜讘讬 <br/>
转砖讜讘讛: 讞讬讜讘讬

诪砖驻讟: 诪砖驻讟 砖诇讬诇讬  <br/>
转砖讜讘讛: 砖诇讬诇讬

诪砖驻讟: 诪砖驻讟 谞讟专诇讬  <br/>
转砖讜讘讛: 谞讟专诇讬

...

诪砖驻讟: 诪砖驻讟 讻诇砖讛讜  <br/>
转砖讜讘讛: 
</blockquote>


3. Winograd (Binary) Acc
    
    
    - **Source**: We use `A Translation of the Winograd Schema Challenge to Hebrew`, translated by Dr. Vered Schwartz. The data can be found [here](https://www.cs.ubc.ca/~vshwartz/resources/winograd_he.jsonl).
    
    - **Scoring**: We provide in the prompt the two possible answers, and compute the accuracy score. 
    
    - **Number of examples**: 278 examples.
    
    - **Few-Shot Format**: For every prompt, we provide 5 few-shot examples, and then the question at hand. Each example is formatted with the input sentence with the question, the possible answers, and the expected answer. 
    
    For example:

<blockquote dir="rtl" style='text-align: right; background-color: #f0f0f0'>
<p>
砖讗诇讛: 讛砖讜讟专讬诐 注爪专讜 讗转 讞讘专讬 讛讻谞讜驻讬讛. 讛诐 谞讬讛诇讜 讗专讙讜谉 砖诇 住讞专 讘住诪讬诐. 诪讬 谞讬讛诇讜?  <br/>
讗驻砖专讜讬讜转: "讞讘专讬 讛讻谞讜驻讬讛" 讗讜 "讛砖讜讟专讬诐"<br/>
转砖讜讘讛: 讞讘专讬 讛讻谞讜驻讬讛

...

砖讗诇讛: 讛砖讜注诇讬诐 讛讬讜 诪讙讬注讬诐 讘诇讬诇讜转 诇转拽讜祝 讗转 讛转专谞讙讜诇讬诐, 讗讝 讛讬讬转讬 爪专讬讱 诇砖诪讜专 注诇讬讛诐. 注诇 诪讬 讛讬讬转讬 爪专讬讱 诇砖诪讜专?<br/>
讗驻砖专讜讬讜转: "讛转专谞讙讜诇讬诐" 讗讜 "讛砖讜注诇讬诐"<br/>
转砖讜讘讛: 
</blockquote>


4. Translation BLEU

    - **Source**: We use the aligned translation corpus `NeuLabs-TedTalks`, which can be found [here](https://opus.nlpl.eu/NeuLab-TedTalks/en&he/v1/NeuLab-TedTalks). 
    
    - **Scoring**: We use the `sacrebleu.sentence_blue` scoring function. 
    
    - **Number of examples**: We took a random 1,000 examples which were 30-40 words in length from the aligned corpus, and compute the mean score for translating those examples from English to Hebrew, and from Hebrew to English (a total of 2,000 examples).
    
    - **Few-Shot Format**: For every prompt, we provide 3 few-shot examples of an English sentence and the Hebrew equivalent. The order depends on the direction that we are attempting to translate to. 
    
    For example:
    
    
<blockquote style="background-color: #f0f0f0;">
<p>
English: Some sentence in English<br/>
Hebrew: 诪砖驻讟 讘注讘专讬转. 

...

English: Some sentence to translate to Hebrew <br/>
Hebrew: 
</blockquote>

"""

EVALUATION_QUEUE_TEXT = """
## Important Note

Due to budget restrictions, we have a cap on the number of models that can be tested a month. Please only send your model when you are ready for testing. We also have limits on the number of models that can be sent per/user. 

## Some good practices before submitting a model

### 1) Make sure you can load your model and tokenizer using AutoClasses:
```python
from transformers import AutoConfig, AutoModel, AutoTokenizer
config = AutoConfig.from_pretrained("your model name", revision=revision)
model = AutoModel.from_pretrained("your model name", revision=revision)
tokenizer = AutoTokenizer.from_pretrained("your model name", revision=revision)
```
If this step fails, follow the error messages to debug your model before submitting it. It's likely your model has been improperly uploaded.

Note: make sure your model is public!
Note: if your model needs `use_remote_code=True`, we do not support this option yet but we are working on adding it, stay posted!

### 2) Convert your model weights to [safetensors](https://huggingface.co/docs/safetensors/index)
It's a new format for storing weights which is safer and faster to load and use. It will also allow us to add the number of parameters of your model to the `Extended Viewer`!

### 3) Make sure your model has an open license!
This is a leaderboard for Open LLMs, and we'd love for as many people as possible to know they can use your model 馃

### 4) Fill up your model card
When we add extra information about models to the leaderboard, it will be automatically taken from the model card

## In case of model failure
If your model is displayed in the `FAILED` category, its execution stopped.
Make sure you have followed the above steps first.
If everything is done and the model still won't run, please reach out to `shaltiel at dicta dot org dot il` with the details. 
"""