Some notes on the models used.

#1
by ParasiticRogue - opened

Hey! Haven't tried your merge yet, but I just wanted to share some notes I've found when doing merges in private.

First is about ChatAllInOne. Although it looks great on the surface, and the datasets used are kinda unique for Vicuna, the actual implementation leaves something to be desired... Every time I try and use it, whether it be as a 4th or 5th model at low percentage, it tends to hallucinate wrong stuff more often. I have a scenario with a character card, asking it about themselves and the lore around to test it's abilities to retain information, and more often than not it gets stuff wrong compared to when I just use the base 3 of Cap Bagel Tess, such as abilities not being consistent or even messing up clothes and colors. Plus it's one of those rambling models that doesn't know when to stop, despite playing with numerous settings.

Thespis I haven't touched, but this is more so just some general knowledge I picked up from somewhere about word formatting with AIs when looking at it's page. Apparently since most written works use plain text for actions/events and the dialogue wrapped in "quotes" that mixing it up with the internet style RP of using asterisks and no quotes supposedly lessens the AI's pull from the traditionally written style in it's data or vice versa when interacting. How much of this is this an actual problem? Not sure, but I figured it was worth bringing to your attention if you weren't aware. It still might be a fine enough model to use though, idk.

Lastly I think i know the reason you weren't able to use Nous Hermes before in mergekit, and my solution is to just not use "union" when doing so to fix it. If you have to use union in a bigger merge then the method I mentioned earlier when Capybara/Bagel where giving me problems should work; take the model and make a copy of it, then slerp the two together into a different base.

Anywho, good luck with the merging experiments!

Thespis I haven't touched, but this is more so just some general knowledge I picked up from somewhere about word formatting with AIs when looking at it's page. Apparently since most written works use plain text for actions/events and the dialogue wrapped in "quotes" that mixing it up with the internet style RP of using asterisks and no quotes supposedly lessens the AI's pull from the traditionally written style in it's data or vice versa when interacting. How much of this is this an actual problem? Not sure, but I figured it was worth bringing to your attention if you weren't aware. It still might be a fine enough model to use though, idk.

Yeah I was afraids of this, and it's actually a big reason I wasn't satisfied with Thespis by itself. I wasn't really impressed with Thespis's internet RP format writing either, seemingly confirming your speculation. But alas... novel syntax seems to work just fine? Actually quite well, with good prose and formatting. shrug

As for ChatAllInOne, I had some good "assistant" style results from it, but I will take a closer look, thanks.

And as for Nous Hermes, its tokenizer config is actually bugged, see: https://huggingface.co/NousResearch/Nous-Hermes-2-Yi-34B/discussions/5

Not sure how much that affects performance in a non union merge, but in general I am wary of it for storywriting because it is a base Yi 4K finetune, not a 200K model. Perhaps it will be better slerped in a short context "assistant" merge, hence I did use someone else's SUS-Hermes merge in the v8 megamerge.

Also forgot to credit you and your thread for the idea, fixed that!

You don't have to go that far, but thanks all the same!

Quick aside though, since I never understood on why this is, but what's the significance of placing the base model inside the merge itself with no parameters as opposed to using it purely in the "base model" tab at the bottom? Is it just so a formality for users looking at the page, or does it help mergekit implement it smoother when using "base model" in the settings? Cause I've been experimenting with just not using chargod's Yi base it to save vram when merging for convenience -using one of the other 5 main models instead for base- and the merges seem to work fine without it, but I just wanted to double check and see if you got any wisdom on that front if it effects something deeper.

Interesting. The base model used to be in the example mergekit configs in methods that require it (SLERP, Ties), but it was removed a month ago!

https://github.com/arcee-ai/mergekit/commit/e52a51b35f1d76189e6b5dd7207754b9d04e2cac#diff-71de39261f847e7946144f85aeaf4dd6143e394fb4cde2879d1e83aaea89a6da

I missed that. I will remove it in my configs from now on.

IIRC it used to take the tokenizer/config from the base model parameter and the actual weights from the base model specified in models, but I guess that duplication isn't necessary anymore.

Generally you should specify the actual base model everything was trained on as the base model though. TIES/SLERP actually use it in their algorithms as a reference, and that reference may be off if you use a finetuned model instead.

Another random bit of trivia:

chargod's Yi base

This is Charles Goddard, aka the mergekit dev, by sheer coincidence!

He "llamafied" Yi a long time ago, but actually thats no longer necessary. You can use base Yi 200K if you want straight from the original repo, as 01.AI have now "llamafied" Yi. I just keep it around because I never bothered to redownload what is essentially the same model.

Crazy idea here, but would using an undertrained model like Capybara for the base net some positive results for a merge, considering we both are trying to aim for a Vicuna build? It would therefore still retain some of the underlining features while keeping with the format, as suggested in your front page for this merge. I already have 2 copies merged into chargod's base as well for clarity, so it should posses similar settings too on that front.

Possibly. I'm trying to wrap my head around the math of using a non base model as a base. I don't think its as simple as diff merging, I will have to go back and read the ties paper.

I actually did use Yi 4K as the "base model" for a merge with Bagel 34B (along with other models), but I'm not sure how well it worked.

Did some further testing with the base changes and I didn't really care for the results when comparing tbh. Seemed less stable in comparison to just using a proper base model for the merge. Oh well. Will probably go back to the 4-way merge too now, since I think that I've exhausted most possibilities with this endeavor, and I haven't found another model that didn't make it worse in some regard or another when testing. Probably just gonna make a slight change personally with placement in the mergekit and Tess, since I guess that plays a slight factor on the format, possibly making it prompt better with Orca-Vicuna? Anyway... Only other thing I can offer is using Pippa-Cleaned from Royallab for the quantized models seems to make them behave better for creative endeavors, at least compared to wikitext or that 20k random junk data which can really make them have no brakes on their output. especially when using 8-bit cache.

Only other thing I can offer is using Pippa-Cleaned from Royallab for the quantized models seems to make them behave better for creative endeavors, at least compared to wikitext or that 20k random junk data which can really make them have no brakes on their output. especially when using 8-bit cache.

On this, yeah I quantized the GGUFs on story data (as well as a mix of some other stuff to normalize it), though still not sure about the exl2s since they are very sensitive at low bpw.

Tell me about it... I like using exl2, but man does the Yi base not like it. I finally fond some results on my own model that empirically shows the 8bt cache making the model dumber in my tests by asking it about historical characters and their dates, and with cache it always cuts off the last digit, so 1948 would be 194. Without cache it doesn't do this, and I've tested different presets to make sure it wasn't a settings issue, and even different models, yours included which uses a different parquet entirely.

But... now that I know this I tried going lower in size (3.5) in order to keep the larger context and not have the model be a dumbo with cache on, but then it just completely breaks using Pippa at the lower size. Wonderful... Yours at least works fine, so now I'm experimenting using different datasets. I've found one that seems interesting to use with Yi, having a diverse/random enough format in wordplay to where it won't get stuck on just 1 type of sentence structure like wikitext's info or typical novel "once upon a time" bs, while being both english and chinese for potentially better IQ seeing as the github forums said using multiple languages if the base was trained on them might help with that.

https://huggingface.co/datasets/magicsword/train-en-zh?not-for-all-audiences=true

It seems like a good compromise between the 20k junk data while still retaining some structure. Maybe getting rid of the lines in the middle might help, but I'll get back to you if this turn out to be totally useless.

I am interested to hear it. You are ahead of my exl2 dataset research!

My tenative plan is to mix random tokens, the default exl2 data (minus stuff that's not needed like coding), vicuna format chats and story like data, so it will hopefully normalize between them.

Then, make the a "short" quantization at the default 8K context, and a "long" quantization at 32K which seems to destroy the short context while helping it at 16K+.

But TBH I may just start using GGUFs once llama.cpp gets flash attention and quantum (quantized) cache itself.

LLMs are wacky... For some reason updating textgen and sillytavern stopped the incomplete number problem in testing, even with the exact same settings. And lower bit versions that didn't work before do now as well. There has to be some degradation when using the cache, but now I don't notice it as much, so maybe going back up in bit is fine enough for quality? idk.

Also, the parquet I used with min-p that I thought was the best, well...it breaks when using the new smoothing factor in settings, so now I had to find something else. I tested 10 different datasets and the en-zh dataset worked in testing, but it wasn't the best. Using a dataset that is already configured for the model's preferred prompt format does indeed make it behave more. Despite my misgivings with it's use a a model, the Chat-All-In-One parquets give the least amount of garbage when doing 20 swipes, with only 3 instances of it spewing the user/char profile at the end. However I'll probably stick with this next one for now, since the only thing it got wrong was sometimes using asterisks instead of plain text (also only 3 out of 20 at 3.5) and it might phase that out with more examples as the chat progresses.

https://huggingface.co/datasets/pvduy/sharegpt_alpaca_oa_vicuna_format?row=92

It has multiturn and the configurations keeps it clean with ASSISTANT: and USER: with even the special separator token at the end as well. Datasets that get complicated without proper formatting like "Context, "role": "assistant" are worse off. So when making your dataset for lower bit models, especially if you plan to use it for models of the Vicuna mold, then I'd highly recommend you format it in such a fashion, maybe even using SYSTEM: if there is a prompt at the beginning too.

In unrelated news, fresh Bagel models are out, and Sao10K has interest in bringing Fimbulvetr to Yi. Kinda happy, but also kinda pissed since it means I have to go back to the drawing board with percentages once everything is out, LOL! And here I finally found a good balance using their previous versions, thinking it was ready to be posted, but got sidetracked with EXL2 and this whole ordeal. Ah well... Rather hold out and relax till then.

By-the-by, since I don't use them, isn't GGUF still kinda borked with Yi models? They still don't do as good as the 7B models on the RP rankings, despite what I consider on the contrary, and GGUF is still used there when testing, which leaves me to be skeptical of the format still when used with Yi over EXL2.

Don't mean to shill, but I have a vicuna dataset of Bluemoon that finally made it stop spewing junk with the right settings if you are interested in including it with to your stuff later for lower quantization.

https://huggingface.co/datasets/ParasiticRogue/Bluemoon-vicuna

Figured this would work better then Pippa just because it uses "quotes" over asterisks in format. I tried to cut down on anything that was not creative related like editors notes, titles, "this is an RP with me and Joe!", name/age/height profiles, time skips, random characters like ~~~ to indicate scene changes, etc. I also moved some franchises around that kept popping up too much so it wouldn't get too focused on one subject in a row, truncating that stuff to the back of the dataset. Not 100% cleaned, as I was doing this by eyeballing stuff that happened to appear, but I've been happy with it thus far. Have a good one!

That's literally perfect, thanks. You're making me really want to start a novel format finetune now, along with some others... I need to find a MI300X and see if FA2 works yet, lol.

Also, I dunno if something is still up with GGUF, but Yi seems to be doing OK in the new test:

https://ayumi.m8geil.de/erp4_chatlogs/#!/index

Sign up or log in to comment