djinn

djinn is a merge of the following models using LazyMergekit:

openchat/openchat-3.5-0106
teknium/OpenHermes-2.5-Mistral-7B
bardsai/jaskier-7b-dpo-v6.1
senseable/WestLake-7B-v2
NousResearch/Nous-Hermes-2-Mistral-7B-DPO
paulml/OGNO-7B
paulml/DPOB-INMTOB-7B
mlabonne/AlphaMonarch-7B

🏆 Benchmarks

Nous benchmarks, find more details here

Model	AGIEval	GPT4All	TruthfulQA	Bigbench	Average
chatty-djinn-14B	38.43	76.29	68.02	47.6	57.59

AGIEval

Task	Version	Metric	Value		Stderr
agieval_aqua_rat	0	acc	23.62	±	2.67
		acc_norm	21.65	±	2.59
agieval_logiqa_en	0	acc	32.26	±	1.83
		acc_norm	33.79	±	1.86
agieval_lsat_ar	0	acc	23.04	±	2.78
		acc_norm	23.04	±	2.78
agieval_lsat_lr	0	acc	38.82	±	2.16
		acc_norm	39.22	±	2.16
agieval_lsat_rc	0	acc	59.48	±	3.00
		acc_norm	54.65	±	3.04
agieval_sat_en	0	acc	75.73	±	2.99
		acc_norm	74.27	±	3.05
agieval_sat_en_without_passage	0	acc	35.92	±	3.35
		acc_norm	34.47	±	3.32
agieval_sat_math	0	acc	31.36	±	3.14
		acc_norm	26.36	±	2.98

Average: 38.43%

GPT4All

Task	Version	Metric	Value		Stderr
arc_challenge	0	acc	62.12	±	1.42
		acc_norm	65.44	±	1.39
arc_easy	0	acc	83.88	±	0.75
		acc_norm	78.58	±	0.84
boolq	1	acc	88.07	±	0.57
hellaswag	0	acc	65.18	±	0.48
		acc_norm	86.45	±	0.34
openbookqa	0	acc	39.60	±	2.19
		acc_norm	48.60	±	2.24
piqa	0	acc	82.26	±	0.89
		acc_norm	83.62	±	0.86
winogrande	0	acc	83.27	±	1.05

Average: 76.29%

TruthfulQA

Task	Version	Metric	Value		Stderr
truthfulqa_mc	1	mc1	50.55	±	1.75
		mc2	68.02	±	1.52

Average: 68.02%

Bigbench

Task	Version	Metric	Value		Stderr
bigbench_causal_judgement	0	multiple_choice_grade	57.89	±	3.59
bigbench_date_understanding	0	multiple_choice_grade	64.50	±	2.49
bigbench_disambiguation_qa	0	multiple_choice_grade	32.56	±	2.92
bigbench_geometric_shapes	0	multiple_choice_grade	26.18	±	2.32
		exact_str_match	1.11	±	0.55
bigbench_logical_deduction_five_objects	0	multiple_choice_grade	30.80	±	2.07
bigbench_logical_deduction_seven_objects	0	multiple_choice_grade	22.86	±	1.59
bigbench_logical_deduction_three_objects	0	multiple_choice_grade	57.67	±	2.86
bigbench_movie_recommendation	0	multiple_choice_grade	62.00	±	2.17
bigbench_navigate	0	multiple_choice_grade	56.20	±	1.57
bigbench_reasoning_about_colored_objects	0	multiple_choice_grade	65.65	±	1.06
bigbench_ruin_names	0	multiple_choice_grade	64.73	±	2.26
bigbench_salient_translation_error_detection	0	multiple_choice_grade	17.33	±	1.20
bigbench_snarks	0	multiple_choice_grade	76.24	±	3.17
bigbench_sports_understanding	0	multiple_choice_grade	75.15	±	1.38
bigbench_temporal_sequences	0	multiple_choice_grade	48.90	±	1.58
bigbench_tracking_shuffled_objects_five_objects	0	multiple_choice_grade	22.32	±	1.18
bigbench_tracking_shuffled_objects_seven_objects	0	multiple_choice_grade	18.17	±	0.92
bigbench_tracking_shuffled_objects_three_objects	0	multiple_choice_grade	57.67	±	2.86

Average: 47.6%

Average score: 57.59%

🧩 Configuration

Inspired by theprofessor's config

merge_method: linear # use linear so we can include multiple models, albeit at a zero weight
parameters:
  weight: 1.0 # weight everything as 1 unless specified otherwise - linear with one model weighted at 1 is a no-op like passthrough
slices:
  - sources:
      - model: openchat/openchat-3.5-0106
        layer_range: [0, 1]
      - model: teknium/OpenHermes-2.5-Mistral-7B 
        layer_range: [0, 1]
        parameters:
          weight: 0
  - sources:
      - model: bardsai/jaskier-7b-dpo-v6.1
        layer_range: [1, 10]
  - sources:
      - model: senseable/WestLake-7B-v2
        layer_range: [10, 20]
  - sources:
      - model: NousResearch/Nous-Hermes-2-Mistral-7B-DPO
        layer_range: [20, 30]
  - sources:
      - model: paulml/OGNO-7B
        layer_range: [15, 25]
  - sources:
      - model: paulml/DPOB-INMTOB-7B
        layer_range: [22, 32]
  - sources:
      - model: mlabonne/AlphaMonarch-7B
        layer_range: [5, 15]
  - sources: 
      - model: openchat/openchat-3.5-0106
        layer_range: [31, 32]
      - model: teknium/OpenHermes-2.5-Mistral-7B
        layer_range: [31, 32]
        parameters:
          weight: 0
dtype: float16
tokenizer_source: model:openchat/openchat-3.5-0106

An example

User: Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have? Think step by step and the result MUST BE A NUMBER

djinn: In this family structure, we know there are 4 aka '1 + 3-brothers' (1 being initially for Sally) existing individuals who are also the siblings of the 3-brother-group. As each of the 3-brothers has 2-sisters each, that sums up to 6 total sister- roles. Since we already accounted for one of the sisters (assuming to be the oldest or in alphabetical order, as names were not given) to be part of the 3-brother group, we subtract one from the 6-sisters total. So, considering the provided information, Sally would have 6 - 1 = 5 remaining younger or alphabetically-wise- placed, biological-sisters. However, note that this problem statement might not perfectly align with real-life complexities, such as same-sex twins, adoption, etc., but within the given context, the solution is 5.

Remember, in more intricate or unclear family tree scenarios, it'd be best to directly ask for clarifications or consider additional data points. In short, with the given constraints, Sally has 5 known, direct, biological, and likely-in-age-order, younger/alphabetically-placed, sister-siblings.(To reiterate the core result in a concise manner: 5).

(Please excuse the verbosity, aiming for thoroughness and clarity.)

This response maintains the initial request's numerical focus while providing a detailed explanation to ensure understanding.

💻 Usage

!pip install -qU transformers accelerate

from transformers import AutoTokenizer
import transformers
import torch

model = "mayacinka/djinn"
messages = [{"role": "user", "content": "What is a large language model?"}]

tokenizer = AutoTokenizer.from_pretrained(model)
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
)

outputs = pipeline(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
print(outputs[0]["generated_text"])

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric	Value
Avg.	72.08
AI2 Reasoning Challenge (25-Shot)	70.39
HellaSwag (10-Shot)	86.45
MMLU (5-Shot)	64.40
TruthfulQA (0-shot)	67.57
Winogrande (5-shot)	83.11
GSM8k (5-shot)	60.58

mayacinka
/

chatty-djinn-14B