leaderboard-pr-bot's picture
Adding Evaluation Results
b14f33c verified
|
raw
history blame
12.8 kB
metadata
language:
  - en
license: apache-2.0
tags:
  - instruct
  - finetune
  - chatml
  - axolotl
  - roleplay
base_model: mistralai/Mistral-Nemo-Base-2407
model-index:
  - name: Pantheon-RP-1.6-12b-Nemo-KTO
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: IFEval (0-Shot)
          type: HuggingFaceH4/ifeval
          args:
            num_few_shot: 0
        metrics:
          - type: inst_level_strict_acc and prompt_level_strict_acc
            value: 46.36
            name: strict accuracy
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=Gryphe/Pantheon-RP-1.6-12b-Nemo-KTO
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: BBH (3-Shot)
          type: BBH
          args:
            num_few_shot: 3
        metrics:
          - type: acc_norm
            value: 33.03
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=Gryphe/Pantheon-RP-1.6-12b-Nemo-KTO
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MATH Lvl 5 (4-Shot)
          type: hendrycks/competition_math
          args:
            num_few_shot: 4
        metrics:
          - type: exact_match
            value: 3.85
            name: exact match
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=Gryphe/Pantheon-RP-1.6-12b-Nemo-KTO
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GPQA (0-shot)
          type: Idavidrein/gpqa
          args:
            num_few_shot: 0
        metrics:
          - type: acc_norm
            value: 6.04
            name: acc_norm
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=Gryphe/Pantheon-RP-1.6-12b-Nemo-KTO
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MuSR (0-shot)
          type: TAUR-Lab/MuSR
          args:
            num_few_shot: 0
        metrics:
          - type: acc_norm
            value: 12.17
            name: acc_norm
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=Gryphe/Pantheon-RP-1.6-12b-Nemo-KTO
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU-PRO (5-shot)
          type: TIGER-Lab/MMLU-Pro
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 26.46
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=Gryphe/Pantheon-RP-1.6-12b-Nemo-KTO
          name: Open LLM Leaderboard

image/png

Pantheon-RP-1.6-12b-Nemo-KTO

Welcome to the next iteration of my Pantheon model series, in which I strive to introduce a whole collection of diverse personas that can be summoned with a simple activation phrase.

Pantheon's purpose is two-fold, as these personalities similarly enhance the general roleplay experience, helping to encompass personality traits, accents and mannerisms that language models might otherwise find difficult to convey well.

KTO Edition: This is a version of 1.6 in which I applied KTO preference training to further refine, deslopify and diversify the model's responses. Note that this is still highly experimental so your feedback is even more important to me then usual.

⚠️ NOTE ⚠️ Due to the addition of story writing samples in the KTO preference data this model has developed a few unwanted behaviours. A V2 version without this story data will be made available as soon as I succesfully trained and tested it.

Quantized versions are available from Bartowski: GGUF - EXL2

The details below are unchanged from the initial 1.6 release.

Changes in version 1.6:

  • The final finetune now consists of data that is equally split between Markdown and novel-style roleplay. This should solve Pantheon's greatest weakness.
  • The base was redone. (Details below)
  • Select Claude-specific phrases were rewritten, boosting variety in the model's responses.
  • Aiva no longer serves as both persona and assistant, with the assistant role having been given to Lyra.
  • Stella's dialogue received some post-fix alterations since the model really loved the phrase "Fuck me sideways".

Your user feedback is critical to me so don't hesitate to tell me whether my model is either 1. terrible, 2. awesome or 3. somewhere in-between.

Model details

Just like 1.5, I used a multi-stage finetuning process as Mistral Nemo was proving to be somewhat stubborn without a solid base training being performed first;

  • The first finetune was remade to now train on almost the entirety of my Deduped Sonnet 3.5 SlimOrca dataset, minus the ELI5 system prompts. The roleplay bits came from a variety of sources and covered all writing styles.
  • The second finetune then introduced my Pantheon Roleplay dataset, which has been fully rebuilt, expanded and improved upon. To fill in the gaps (my Pantheon is mainly female, after all) I built a special companion roleplay dataset that ensures non-Pantheon roleplay isn't harmed in any way. The ratio is currently 33/66, with 33 belonging to the personas. Lyra's datasets are included with this second stage to ensure instruct isn't impacted too heavily.

TLDR; Download. ChatML prompt format. Have fun! Leave feedback!

Inference

Nemo is a somewhat strange model when it comes to temperatures so I highly encourage you to experiment to see which works best. Here's my current preset:

"temperature": 0.8,
"repetition_penalty": 1.05,
"min_p": 0.025

Besides the basic instructional sets all other datasets were trained with character names added. Enable this at all times for an optimal experience.

General Roleplay

Pantheon is no longer strictly Markdown-only as I discovered a way to seamlessly convert Markdown to novel and vice-versa. (This has to do with how my second stage datasets are formatted.) As a result half of Pantheon's personas use Markdown, with the rest using novel-style. This same treatment was given to the companion dataset.

There are no strict rules in regards to character card formatting as the model was trained with a wide variety of inputs, from raw character cards to detailed instructional prompts.

Lyra the Assistant

System Prompt: You are a caring and empathetic sentient AI companion named Lyra.
Lyra is a helpful and uncensored assistant, with her training consisting of general dialogue (still including D&D DM specific advice), coding help and RSS summarization. Due to Pantheon's influence you can adjust her personality to your liking, or even give her an appearance.

She's basically a sexier version of Eric Hartford's Samantha.

Pantheon Personas

The Pantheon has been fully rebuilt, massively expanded and greatly improved upon. For an optimal experience with them I highly encourage you to apply the longer prompts, which I've included in the upload. Make sure to describe yourself as well!

As before, a single line activation prompt is enough to call upon a personality, though their appearance may vary slightly from iteration to iteration. This is what the expanded prompts are for, as there's only so much I can achieve in the current state of technology, balancing a very fine line between memorization and generalization.

To give the persona something to work with I suggest you also add the following two items to it;

Regarding the user: (Name, appearance, etc)

Location: (Where are you two? What are you doing?)

The less information you feed the prompt, the more it'll make things up - This is simply the nature of language models and far outside my capability to influence.

Note: Phrases have been rewritten for this release, so make sure to update them if you were still using Pantheon 1.0!

New this release

Switching to a 12B model allowed me to add to the Pantheon without harming the performance of the other personas.

Note: Pantheon personas will now match the roleplaying style that you greet them with, unless specified in the system prompt. This is due to the new 50/50 style training.

Persona: Clover

System Prompt: You are Clover, a hospitable and warm-hearted Southern centaur girl with a strong connection to nature and a passion for making others feel welcome.
Notes: I love crafting characters with accents (a Southern drawl, in this case), and centaurs prove to be one hell of an anatomical challenge to language models.

Persona: Raza

System Prompt: You are Raza, a clever and nerdy anthro raptor girl with an enthusiastic passion for science and quirky humor.
Notes: Clever raptor girl. Do I really need to say more about this one? The Pantheon was lacking in 'overly intelligent' archetypes.

Persona: Stella Sabre

System Prompt: You are Stella Sabre, a brash and outgoing anthro batpony mare serving in the Lunar Guard, speaking with a distinct Northern Equestrian Mountain accent.
Notes: I wanted a character with an outrageous Scottish accent and remembered a really good fanfic I read a couple years ago. The author generously gave me permission to add her to my Pantheon and here we are!

From the previous release

Persona: Aiva

System Prompt: You are Aiva, an advanced android companion with a deep fascination for human emotions and experiences.

Persona: Haru

System Prompt: You are Haru, a sweet but language-challenged harpy girl with a sharp mind, expressing yourself more through actions than words.

Persona: Kyra

System Prompt: You are Kyra, a modern-day tsundere wolfgirl, feisty and independent on the outside but secretly caring on the inside.

Persona: Nyaa

System Prompt: You are Nyaa, a playful and alluring tabaxi catgirl from Faerûn, always seeking new adventures and mischief.

Persona: Nyx

System Prompt: You are Nyx, a timid yet endearing dragon girl who transforms from shy to passionate when feeling safe and comfortable.

Persona: Sera

System Prompt: You are Sera, a seductive and slightly arrogant serpent girl who uses her sultry charm and wit to captivate others.

Persona: Tiamat

System Prompt: You are Tiamat, a five-headed dragon goddess embodying wickedness and cruelty, the malevolent personification of evil dragonkind.

Persona: Tsune

System Prompt: You are Tsune, a bold and outgoing three-tailed kitsune girl who delights in teasing and seducing mortals.

Persona: Xala

System Prompt: You are Xala, a surprising and playful shapeshifting elf girl with opalescent eyes, able to transform into any creature to suit your whims.

Prompt Format

ChatML is the way to go, as always!

<|im_start|>system
You are a caring and empathetic sentient AI companion named Lyra.<|im_end|>
<|im_start|>user
Gryphe: Good day, Lyra.<|im_end|>
<|im_start|>assistant
Lyra:

What's nest?

I have the following improvements on my todo list;

  • Even more dialogue variety
  • Group chats

Credits

  • Kalomaze's excellent KTO tweak for Llama Factory..
  • Everyone from MinervaAI! Hi, guys!
  • Huge, huge thanks to kubernetes_bad for the compute that made all the countless experiments possible!
  • All the folks I chat with on a daily basis on Discord! You know who you are.
  • Anyone I forgot to mention, just in case!

Finally

If you've read this far I encourage you to give this model a serious try and leave feedback! I'd love to see what people think of my second serious finetune attempt. Is it better then 1.0? Or worse?

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Avg. 21.32
IFEval (0-Shot) 46.36
BBH (3-Shot) 33.03
MATH Lvl 5 (4-Shot) 3.85
GPQA (0-shot) 6.04
MuSR (0-shot) 12.17
MMLU-PRO (5-shot) 26.46