leaderboard-pr-bot's picture
Adding Evaluation Results
b2a9904 verified
|
raw
history blame
8.37 kB
---
language:
- ko
license: cc-by-nc-4.0
library_name: transformers
pipeline_tag: text-generation
model-index:
- name: Synatra-V0.1-7B
results:
- task:
type: text-generation
name: Text Generation
dataset:
name: AI2 Reasoning Challenge (25-Shot)
type: ai2_arc
config: ARC-Challenge
split: test
args:
num_few_shot: 25
metrics:
- type: acc_norm
value: 55.29
name: normalized accuracy
source:
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=maywell/Synatra-V0.1-7B
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: HellaSwag (10-Shot)
type: hellaswag
split: validation
args:
num_few_shot: 10
metrics:
- type: acc_norm
value: 76.63
name: normalized accuracy
source:
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=maywell/Synatra-V0.1-7B
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: MMLU (5-Shot)
type: cais/mmlu
config: all
split: test
args:
num_few_shot: 5
metrics:
- type: acc
value: 55.29
name: accuracy
source:
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=maywell/Synatra-V0.1-7B
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: TruthfulQA (0-shot)
type: truthful_qa
config: multiple_choice
split: validation
args:
num_few_shot: 0
metrics:
- type: mc2
value: 55.76
source:
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=maywell/Synatra-V0.1-7B
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: Winogrande (5-shot)
type: winogrande
config: winogrande_xl
split: validation
args:
num_few_shot: 5
metrics:
- type: acc
value: 72.77
name: accuracy
source:
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=maywell/Synatra-V0.1-7B
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: GSM8k (5-shot)
type: gsm8k
config: main
split: test
args:
num_few_shot: 5
metrics:
- type: acc
value: 19.41
name: accuracy
source:
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=maywell/Synatra-V0.1-7B
name: Open LLM Leaderboard
---
# **V0.3 IS UP**
[Link to V0.3](https://huggingface.co/maywell/Synatra-7B-v0.3-base)
# **Synatra-V0.1-7B**
Made by StableFluffy
[Visit my website! - Currently on consturction..](https://www.stablefluffy.kr/)
## License
This model is strictly [*non-commercial*](https://creativecommons.org/licenses/by-nc/4.0/) (**cc-by-nc-4.0**) use only which takes priority over the **LLAMA 2 COMMUNITY LICENSE AGREEMENT**.
The "Model" is completely free (ie. base model, derivates, merges/mixes) to use for non-commercial purposes as long as the the included **cc-by-nc-4.0** license in any parent repository, and the non-commercial use statute remains, regardless of other models' licences.
The licence can be changed after new model released.
## Model Details
**Base Model**
[mistralai/Mistral-7B-Instruct-v0.1](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1)
**Trained On**
A6000 48GB * 8
## Instruction format
**ν•™μŠ΅ κ³Όμ •μ˜ μ‹€μˆ˜λ‘œ [/INST]κ°€ μ•„λ‹Œ [\INST]κ°€ μ μš©λ˜μ—ˆμŠ΅λ‹ˆλ‹€. v0.2 μ—μ„œ μˆ˜μ • 될 μ˜ˆμ •μž…λ‹ˆλ‹€.**
In order to leverage instruction fine-tuning, your prompt should be surrounded by `[INST]` and `[\INST]` tokens. The very first instruction should begin with a begin of sentence id. The next instructions should not. The assistant generation will be ended by the end-of-sentence token id.
Plus, It is strongly recommended to add a space at the end of the prompt.
E.g.
```
text = "<s>[INST] μ•„μ΄μž‘ λ‰΄ν„΄μ˜ 업적을 μ•Œλ €μ€˜. [\INST] "
```
# **Model Benchmark**
## KULLM Evaluation
ꡬ름v2 repo μ—μ„œ μ œκ³΅λ˜λŠ” 데이터셋과 ν”„λ‘¬ν”„νŠΈλ₯Ό μ‚¬μš©ν•˜μ—¬ ν‰κ°€ν–ˆμŠ΅λ‹ˆλ‹€.
λ‹Ήμ‹œ GPT4와 ν˜„μž¬ GPT4κ°€ μ™„μ „νžˆ λ™μΌν•˜μ§€λŠ” μ•ŠκΈ°μ— μ‹€μ œ 결과와 μ•½κ°„μ˜ 차이가 쑴재 ν•  수 μžˆμŠ΅λ‹ˆλ‹€.
![img](./kullm_eval.png)
| Model | 이해가λŠ₯μ„± | μžμ—°μŠ€λŸ¬μ›€ | λ§₯λ½μœ μ§€ | ν₯λ―Έλ‘œμ›€ | μ§€μ‹œμ–΄μ‚¬μš© | μ „λ°˜μ ν€„λ¦¬ν‹°
| --- | --- | --- | --- | --- | --- | ---
| GPT-3.5 | 0.980 | 2.806 | 2.849 | 2.056 | 0.917 | 3.905
| GPT-4 | 0.984 | 2.897 | 2.944 | 2.143 | 0.968 | 4.083
| KoAlpaca v1.1 | 0.651 | 1.909 | 1.901 | 1.583 | 0.385 | 2.575
| koVicuna | 0.460 | 1.583 | 1.726 | 1.528 | 0.409 | 2.440
| KULLM v2 | 0.742 | 2.083 | 2.107 | 1.794 | 0.548 | 3.036
| **Synatra-V0.1-7B** | **0.960** | **2.821** | **2.755** | **2.356** | **0.934** | **4.065**
## KOBEST_BOOLQ, SENTINEG, WIC - ZERO_SHOT
[EleutherAI/lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/polyglot)λ₯Ό μ‚¬μš©ν•˜μ—¬ BoolQ, SentiNeg, Wic을 μΈ‘μ •ν–ˆμŠ΅λ‹ˆλ‹€.
HellaSwag와 COPAλŠ” μ›λ³Έμ½”λ“œλ₯Ό μˆ˜μ •ν•˜λŠ” κ³Όμ •μ—μ„œ 어렀움을 κ²ͺμ–΄ 아직 μ§„ν–‰ν•˜μ§€ μ•Šμ•˜μŠ΅λ‹ˆλ‹€.
### NOTE
BoolQμ—λŠ” Instruction λͺ¨λΈμ˜ 이해λ₯Ό λ•κΈ°μœ„ν•΄ "μœ„ 글에 λŒ€ν•œ μ§ˆλ¬Έμ— 사싀을 ν™•μΈν•˜λŠ” μž‘μ—…μž…λ‹ˆλ‹€.", "예, μ•„λ‹ˆμ˜€λ‘œ λŒ€λ‹΅ν•΄μ£Όμ„Έμš”."의 ν”„λ‘¬ν”„νŠΈλ₯Ό μΆ”κ°€ν–ˆμŠ΅λ‹ˆλ‹€.
SentiNegμ—λŠ” Instruction λͺ¨λΈμ˜ 이해λ₯Ό λ•κΈ°μœ„ν•΄ "μœ„ λ¬Έμž₯의 긍정, λΆ€μ • μ—¬λΆ€λ₯Ό νŒλ‹¨ν•˜μ„Έμš”."의 ν”„λ‘¬ν”„νŠΈλ₯Ό μΆ”κ°€ν–ˆμŠ΅λ‹ˆλ‹€.
Wic의 κ²½μš°λŠ” [INST], [\INST]만 μΆ”κ°€ν•˜μ˜€μŠ΅λ‹ˆλ‹€.
| Model | COPA | HellaSwag | BoolQ | SentiNeg | Wic
| --- | --- | --- | --- | --- | ---
| EleutherAI/polyglot-ko-12.8b | 0.7937 | 0.5954 | 0.4818 | 0.9117 | 0.3985
| **Synatra-V0.1-7B** | **NaN** | **NaN** | **0.849** | **0.8690** | **0.4881**
# **Implementation Code**
Since, chat_template already contains insturction format above.
You can use the code below.
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda" # the device to load the model onto
model = AutoModelForCausalLM.from_pretrained("maywell/Synatra-V0.1-7B")
tokenizer = AutoTokenizer.from_pretrained("maywell/Synatra-V0.1-7B")
messages = [
{"role": "user", "content": "What is your favourite condiment?"},
]
encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")
model_inputs = encodeds.to(device)
model.to(device)
generated_ids = model.generate(model_inputs, max_new_tokens=1000, do_sample=True)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])
```
If you run it on oobabooga your prompt would look like this. - ** Need to add Space at the end! **
```
[INST] 링컨에 λŒ€ν•΄μ„œ μ•Œλ €μ€˜. [\INST]
```
> Readme format: [beomi/llama-2-ko-7b](https://huggingface.co/beomi/llama-2-ko-7b)
---
# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_maywell__Synatra-V0.1-7B)
| Metric | Value |
|-----------------------|---------------------------|
| Avg. | 53.54 |
| ARC (25-shot) | 55.29 |
| HellaSwag (10-shot) | 76.63 |
| MMLU (5-shot) | 55.29 |
| TruthfulQA (0-shot) | 55.76 |
| Winogrande (5-shot) | 72.77 |
| GSM8K (5-shot) | 19.41 |
| DROP (3-shot) | 39.63 |
# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_maywell__Synatra-V0.1-7B)
| Metric |Value|
|---------------------------------|----:|
|Avg. |55.86|
|AI2 Reasoning Challenge (25-Shot)|55.29|
|HellaSwag (10-Shot) |76.63|
|MMLU (5-Shot) |55.29|
|TruthfulQA (0-shot) |55.76|
|Winogrande (5-shot) |72.77|
|GSM8k (5-shot) |19.41|