AetherArchitectural/Community-Discussions · Llama 3 coping mechanisms

AetherArchitectural org May 3

•

The upcoming season. Now scattered across 4 different streaming services for your ~~dis~~pleasure. Apologies if this tangents too hard.

This is a direct Part 4 continuation of Part 3 in this thread.

Lewdiculous pinned discussion May 3

ABX-AI

May 3

2nd

Virt-io

May 3

凸(⊙▂⊙✖ )

I was doing my "I use arch BTW" how could you.

Lewdiculous

AetherArchitectural org May 3

BTW there is supposed to be a way to setup arch in wsl: https://wsldl-pg.github.io/ArchW-docs/How-to-Setup/

But I've never tried this. You may also use arch via docker and enable all of the virtualization stuff there

@ABX-AI - This is a bit finicky but it works, you can install WSL Arch like that.

ABX-AI

May 3

I'm good on windows 10 tbh, is there any real benefit of going into linux at the moment?

saishf

May 3

BTW there is supposed to be a way to setup arch in wsl: https://wsldl-pg.github.io/ArchW-docs/How-to-Setup/

But I've never tried this. You may also use arch via docker and enable all of the virtualization stuff there

@ABX-AI - This is a bit finicky but it works, you can install WSL Arch like that.

I barely handle Ubuntu, I'd go insane trying to setup a non included distro 😭

Virt-io

May 3

•

edited May 3

For a normal user not really, linux desktop is transitioning to Wayland. And some things just don't work perfectly yet.

Lewdiculous

AetherArchitectural org May 3

For WSL2, this:

https://github.com/bostrot/wsl2-distro-manager

Is pretty convenient, you can install many docker distros.

saishf

May 3

I'm good on windows 10 tbh, is there any real benefit of going into linux at the moment?

Linux is for the people that enjoy pain :3
Can never change my mind 🐥

Lewdiculous

AetherArchitectural org May 3

•

edited May 3

I have enough issues on Windows already, I don't wanna deal with them with Linux quirks on top of that...

Yet. I am willing to give Valve a chance when SteamOS desktop comes out with HDR and VRR fully supported and without issues. At least KDE is on top of that, but I prefer Gnome's interface so I'm holding for that one too.

saishf

May 3

For WSL2, this:

https://github.com/bostrot/wsl2-distro-manager

Is pretty convenient, you can install many docker distros.

Wouldn't have multiple distros eat a bunch of storage or is Linux tiny?

Virt-io

May 3

Most distros take like 20GB or less.

saishf

May 3

I have enough issues on Windows already, I don't wanna deal with them with Linux quirks on top of that.

Spent a total of like 10 hours over 4 days to set up axolotl in Ubuntu and now I have a grudge with conda cause it was the problem the whole time

saishf

May 3

•

edited May 3

Most distros take like 20GB or less.

My Ubuntu WSL2 is 124GB 😿 I do have like 6 programs installed though.

ABX-AI

May 3

For a normal user not really, linux desktop is transitioning to Wayland. And some things just don't work perfectly yet.

Yeah, that's my impression as well. I used to run arch on my old laptop but it was mostly just for fun ("the fun of troubleshooting" I mean :D)
Until Linux offers native support for Ableton, I'm out, and it's pretty shit for gaming as well. For LLM, it's not like I'm gonna squeeze anything more out of RTX 3070 on windows already^^

@saishf why did you need to build kobold on linux? 1.64 already offers support for flash attention if that was the idea

Virt-io

May 3

Really? I use conda for all my python packages.

Lewdiculous

AetherArchitectural org May 3

•

edited May 3

Wouldn't have multiple distros eat a bunch of storage or is Linux tiny?

They don't take too much space, mine is around 12GB because well, you use it to get things done and that's it, I'm not installing games and GUI shite there anyways.

I store my stuff on Windows, btw.

Virt-io

May 3

@saishf

My Ubuntu WSL2 is 124GB 😿 I do have like 6 programs installed though.

You must have downloaded models and used regular conda. (I use miniconda)

saishf

May 3

Really? I use conda for all my python packages.

I didn't know how to use it properly, thought it was a problem with axolotl and never even checked if it was conda. It's a rather common occurrence :3

ABX-AI

May 3

Or used old kobold versions that do littering on every launch :D altho i'm not sure if that happens in the same way like it did on windows before we raised the bug and concedo fixed it :>

saishf

May 3

I'd remove stuff but I can't access some WSL2 directories from file explorer, they show up when I use "ls" in WSL2, but in file explorer they don't exist

Lewdiculous

AetherArchitectural org May 3

•

edited May 3

Or used old kobold versions that do littering on every launch

What's the equivalent of Tree Size for Linux CLI for Sais to check what's eating all the space?

saishf

May 3

Or used old kobold versions that do littering on every launch :D altho i'm not sure if that happens in the same way like it did on windows before we raised the bug and concedo fixed it :>

I'm just using nexesenexs koboldcpp fork, it's easy 😺

saishf

May 3

Or used old kobold versions that do littering on every launch

What's the equivalent of Tree Size for Linux CLI for Sais to check what's eating all the space?

I'll be able to try tomorrow, it's almost 4am and I have to be up by 8 😿
Hate timezones :3

Virt-io

May 3

•

edited May 3

@Lewdiculous

Honestly, I do not know.

This is the only one I know of:

df -h

When I need to delete stuff I just use ncdu

My "I use Arch BTW" rights have been revoked.

Virt-io

May 3

•

edited May 3

I did the merge, uploading it.

Lumimaid + Poppy_Porpoise + InfinityRP + L3-TheSpice + SOVL

Looks promising.

Weird issues are probably from not using the fixed tokenizer. I do not have enough ram.

It seems my presets confused it. This is with the base Sillytavern llama3-instruct presets.

I am going to rename it after it finishes uploading.

Virt-io

May 3

༎ຶ‿༎ຶ Took 3hrs to upload. ദ്ദി ༎ຶ‿༎ຶ )

I used topk to constrain the token probabilities, and it seems stable.

Lewdiculous

AetherArchitectural org May 3

https://huggingface.co/Virt-io/Llama-3-8B-Irene-v0.1

The card image 😅

Virt-io

May 3

All my paintings look sad.

ABX-AI

May 3

All my paintings look sad.

sending you a warm and fuzzy hug

grimjim

May 3

For Linux, try the du utility filtered through grep/more. The -s option is useful.

Virt-io

May 3

@Lewdiculous

https://www.reddit.com/r/LocalLLaMA/comments/1cji53a/possible_bug_unconfirmed_llama3_gguf/

Hopefully this isn't true.

I can imagine nitral laughing in the corner with EXL2

ABX-AI

May 3

I saw threads with people questioning if quantization worsens llama3 far more than other models, this may very well be true. Plus, wasn't there a thing about whether you start with 32 vs fp16 and people saying if u start at 16 it's degrading it?

Nitral-AI

May 3

@Lewdiculous

https://www.reddit.com/r/LocalLLaMA/comments/1cji53a/possible_bug_unconfirmed_llama3_gguf/

Hopefully this isn't true.

I can imagine nitral laughing in the corner with EXL2

No im sick of having to redo ggufs ffs, lmao.

Virt-io

May 3

The issue was that bf16 doesn't convert to fp16 as nicely as bf16 to fp32.

Causing models that converted bf16 pytorch to fp16 gguf to have some lost data.

grimjim

May 3

The released weights are bf16, so there's some initial loss of precision compared to fp16.

Nitral-AI

May 3

•

edited May 3

Im not even dropping ggufs of 0.72 till i know everything is fixed. (which is annoying since ive had like 10 people ask about it already)

Lewdiculous

AetherArchitectural org May 4

•

edited May 4

No im sick of having to redo ggufs ffs, lmao.

Ikr. Need to redo imatrix data too each time there are changes. Thankfully it's GPU accelerated but still...

Lewdiculous

AetherArchitectural org May 4

•

edited May 4

@Nitral-AI

which is annoying since ive had like 10 people ask about it already

If I had known demand was this high I'd have uploaded lmao

Will do these already and just hope it works out, surely...

Nitral-AI

May 4

@Nitral-AI

which is annoying since ive had like 10 people ask about it already

If I had known demand was this high I'd have uploaded lmao

Will do these already and just hope it works out, surely...

No worries lmao, i figured you were waiting for the next release here. Orthogonal is still in progress, data collection for additional opus entries begins tommorrow.

Nitral-AI

May 4

No im sick of having to redo ggufs ffs, lmao.

Ikr. Need to redo imatrix data too each time there are changes. Thankfully it's GPU accelerated but still...

Yea doing imatrix takes so much longer, i dont even bother for when im doing test quants anymore.

Lewdiculous

AetherArchitectural org May 4

•

edited May 4

@Nitral-AI

figured you were waiting for the next release here

I am, lmao. Yeah. Make it a juicy one, and don't hate me when I complain about a misplaced asterisk alright I'll have a shirt of that at this point.

Nitral-AI

May 4

@Nitral-AI

figured you were waiting for the next release here

I am, lmao. Yeah. Make it a juicy one, and don't hate me when I complain about a misplaced asterisk alright I'll have a shirt of that at this point.

Next upload will probably be 0.8 as orthogonal, 1.0 will be the ft with the opus data and more.

Lewdiculous

AetherArchitectural org May 4

Test for https://huggingface.co/Lewdiculous/Llama-3-8B-Irene-v0.1-GGUF-IQ-Imatrix

saishf

May 4

@Lewdiculous

Honestly, I do not know.

This is the only one I know of:
df -h
When I need to delete stuff I just use ncdu

My "I use Arch BTW" rights have been revoked.

I tried ncdu, it's scanning windows c drive because it's accessible from wsl 😭
I'm just gonna delete the distro and start over, there's nothing important😸

saishf

May 4

@Lewdiculous

https://www.reddit.com/r/LocalLLaMA/comments/1cji53a/possible_bug_unconfirmed_llama3_gguf/

Hopefully this isn't true.

I can imagine nitral laughing in the corner with EXL2

I'd use EXL2 if it supported context shift, it's like magic

Lewdiculous

AetherArchitectural org May 4

•

edited May 4

I think they actually have something like that now, those smug EXL2 users... Don't fold! Also using EXL2 in my GTX GPU is trash.

saishf

May 4

I think they actually have something like that now, those smug EXL2 users... Don't fold! Also using EXL2 in my GTX GPU is trash.

It seems to perform okay on Turing(20-series) but still slower than gguf

ABX-AI

May 4

•

edited May 4

EXL2 at 4.5bpw is less precision than Q4_K_M, anything above that is also better, and the best exl2 I can realistically fit is 7B / 4.5bpw. So... Why would I do that instead of using 11B @ q4_K_M, or smaller models at Q5/Q6 quants? I think exl2 only matters if you actually have a bigger GPU that can fit more ^^

saishf

May 4

EXL2 at 4.5bpw is less precision than Q4_K_M, anything above that is also better, and the best exl2 I can realistically fit is 7B / 4.5bpw. So... Why would I do that instead of using 11B @ q4_K_M, or smaller models at Q5/Q6 quants? I think exl2 only matters if you actually have a bigger GPU that can fit more ^^

I occasionally used it for the context quantization, 16/8/4 bit. But gguf with flash attention kinda kills that

Lewdiculous

AetherArchitectural org May 6

@ABX-AI @Virt-io @saishf Pain continues:

https://www.reddit.com/r/LocalLLaMA/comments/1ckvx9l/_/

Virt-io

May 6

@Lewdiculous

Oh no, I read that thread a couple of hours ago and it was still undecided.

But now, it looks like there might actually be something wrong.

Lewdiculous

AetherArchitectural org May 6

So I finally got to https://huggingface.co/openlynn/Llama-3-Soliloquy-8B-v2 and oh boy isn't it interesting. It's showing better formatting and for my asterisks so far. Would need to test long sessions but it looks very promising!

Virt-io

May 6

openlynn/Llama-3-Soliloquy-8B-v2 is very good at writing characters. Though not as good at code(I have to switch models to make P-lists.).

Excited to see some merges try and make it smarter while retaining its writing style and formatting.

Nitral-AI

May 6

openlynn/Llama-3-Soliloquy-8B-v2 is very good at writing characters. Though not as good at code(I have to switch models to make P-lists.).

Excited to see some merges try and make it smarter while retaining its writing style and formatting.

@Lewdiculous @Virt-io running a test merge now for memes, if you see it renamed properly you will know it went well.

saishf

May 6

So I finally got to https://huggingface.co/openlynn/Llama-3-Soliloquy-8B-v2 and oh boy isn't it interesting. It's showing better formatting and for my asterisks so far. Would need to test long sessions but it looks very promising!

What is the jinja2 prompt format? is it just an alteration of llama3's prompt format or something else entirely?

saishf

May 6

Also if you wanted an update on uncensoring llama3, progress is being made with OAS wassname/meta-llama-3-8b-instruct-helpfull
Perplexity is increased as a result but it's a good start 😸

Virt-io

May 6

@saishf

It looks like normal llama3 format.

ABX-AI

May 6

"From my understanding, it's possible that all llama-3 finetunes out there, and perhaps even the base llama-3, are being damaged upon conversion to the GGUF format."

I guess we are still waiting for llama3 REAL gguf release lol

Nitral-AI

May 6

So I finally got to https://huggingface.co/openlynn/Llama-3-Soliloquy-8B-v2 and oh boy isn't it interesting. It's showing better formatting and for my asterisks so far. Would need to test long sessions but it looks very promising!

What is the jinja2 prompt format? is it just an alteration of llama3's prompt format or something else entirely?

Some of the backends like ooba/tabby use them, jinja templates are the equivalent to the instruct templates we use in ST for different prompt formats.

Example of the Llama 3 jinja template i made for tabby.

Lewdiculous

AetherArchitectural org May 6

Ah, so they are just another format for representing the presets, makes sense since ST's are kind of just for their front end and this seems more universal.

saishf

May 6

Ah, so they are just another format for representing the presets, makes sense since ST's are kind of just for their front end and this seems more universal.

I'm glad I don't have to deal with templates like that in st, it's like 3 times as long 😶‍🌫️

saishf

May 6

Some Llama-3-Soliloquy-8B-v2 weights/tensors converted to images :3
They're much more interesting zoomed in

blk-0-attn_q-weight
https://files.catbox.moe/l22t7l.png
blk-16-attn_k-weight
https://files.catbox.moe/8iggfv.png
blk-16-attn_output-weight
https://files.catbox.moe/8b5jd2.png
blk-16-attn_q-weight
https://files.catbox.moe/27t8zw.png
blk-16-attn_v-weight
https://files.catbox.moe/k6nh6v.png
blk-16-ffn_down-weight
https://files.catbox.moe/7zo79i.png
blk-16-ffn_gate-weight
https://files.catbox.moe/kjy4og.png
blk-16-ffn_up-weight
https://files.catbox.moe/vz1j77.png

grimjim

May 7

•

edited May 7

I merged poppy-porpoise 0.72 back with Instruct to see what would happen. I've been testing it a lot recently.
https://huggingface.co/grimjim/llama-3-merge-pp-instruct-8B

Virt-io

May 7

•

edited May 7

Can you upload a Q8_0 GGUF for testing?

I don't have enough ram to do the fix.

grimjim

May 7

I'll make a temporary repo for a Q8_0 GGUF. There seems to be a remaining tokenization issue that needs fixing, although things are better. Give me a few.

grimjim

May 7

•

edited May 7

This repo will "self-destruct" when the GGUF dust has settled and I get around to making more quants.
https://huggingface.co/grimjim/llama-3-merge-pp-instruct-8B-GGUF-temp/tree/main

Virt-io

May 7

I'm going to cry. :| Your upload speed is really fast.

Lewdiculous

AetherArchitectural org May 7

We feel for you, Virt...

Virt-io

May 7

@grimjim

Looks good, though I am using a pretty long prompt.

Can you try merging grimjim/llama-3-merge-pp-instruct-8B with openlynn/Llama-3-Soliloquy-8B-v2 in the same manner?

grimjim

May 7

I've set up an overnight merge going. Will report back eventually.

saishf

May 7

I messed around with merging llama3 models, it's a little insane.

It kinda feels like llama 3 instruct, but evil
Roleplay testing is next

saishf

May 7

•

edited May 7

Also NeverSleep/Llama-3-Lumimaid-8B-v0.1-OAS just popped up, could be interesting.

Update - it's good, will be evil. No reluctance yet :3

grimjim

May 8

•

edited May 8

My impression of the resulting merge so far is: slightly off for realism with occasional slip-ups, but interestingly varied for outputs. It seems to be able to track formatting alright (none of the weird space before an "'s" that turned up when I was testing Soliloquy), so I'll post full weights later.

grimjim

May 8

There seems to be something off about current L3 training with open tools in general. Hopefully it gets resolved.

In the meantime, the promised weights:
https://huggingface.co/grimjim/llama-3-merge-virt-req-8B

jeiku

May 10

@Lewdiculous I have returned from my slumber and put together this: https://huggingface.co/jeiku/Orthocopter_8B

It's just a lora merge of 2 of my datasets over an orthogonal model with only 10/1000 refusals after dealignment. This model will retain l3 instruct quirks with no refusals and a bit of personality for good measure. Have not tested rp format though.

Lewdiculous

AetherArchitectural org May 10

@jeiku Welcome back!

grimjim

May 11

•

edited May 11

For those who want a source of noise for an L3 8B model merge, I converted Nvidia's ChatQA to safetensors format to enable easy use in mergekit. It's broken for RP, doesn't bench well, but the fine-tune has affected all layers.
https://huggingface.co/grimjim/llama-3-nvidia-ChatQA-1.5-8B

Lewdiculous

AetherArchitectural org May 11

It makes a come back.

Virt-io

May 11

If you have a newer Nvidia GPU, you can use the CUDA 12 version koboldcpp_cu12.exe (much larger, slightly faster).

Nice, no reason to build it from source. Unless you want the experimental branch.

SerialKicked

May 11

•

edited May 11

Hi guys.

It feels like the right thread to compare notes on L3. Being all new here, I'm not entirely sure of the general rules about jumping into other groups' threads, so please tell me if I'm not welcome here. I won't take offense. Anyway, is it just me or L3 models have a VERY difficult time taking their full context into account compared to models derived from Mistral 0.2 (base)? I don't mean when asked directly to find something into their context window (that it can generally do), but properly using it when building a response. I can't count how many times I had situations like this with L3:

LLM > The inspector and the suspect move to the interrogation room, leaving you alone in the lobby.
Me > waiting filler
LLM > As you wait, the suspect asks you "hey what you're in for?"

I'm pushing a bit, but that's the gist of it. Meanwhile Mistral will reference "previous events" from its context all on its own without any coaxing. I probably should mention I'm talking about the 7-8B models, run in Q8.0 at native context length. I sorta assume it's because L3 is much newer and that you guys are not yet familiar with how it should be tuned. Is that your experience as well? Comments, ideas?

Lewdiculous

AetherArchitectural org May 11

@SerialKicked - Everyone and everything is welcome :')

saishf

May 12

Hi guys.

It feels like the right thread to compare notes on L3. Being all new here, I'm not entirely sure of the general rules about jumping into other groups' threads, so please tell me if I'm not welcome here. I won't take offense. Anyway, is it just me or L3 models have a VERY difficult time taking their full context into account compared to models derived from Mistral 0.2 (base)? I don't mean when asked directly to find something into their context window (that it can generally do), but properly using it when building a response. I can't count how many times I had situations like this with L3:

LLM > The inspector and the suspect move to the interrogation room, leaving you alone in the lobby.
Me > waiting filler
LLM > As you wait, the suspect asks you "hey what you're in for?"

I'm pushing a bit, but that's the gist of it. Meanwhile Mistral will reference "previous events" from its context all on its own without any coaxing. I probably should mention I'm talking about the 7-8B models, run in Q8.0 at native context length. I sorta assume it's because L3 is much newer and that you guys are not yet familiar with how it should be tuned. Is that your experience as well? Comments, ideas?

I'm unsure if you are but as a start it would be worth trying these presets Virt-io/SillyTavern-Presets (best presets and samplers for llama3)
And yeah, it's going to take a while for fine-tuning to be at the level Mistral was.
Mistral was the same arch as Llama2. So had years worth of advancements already made with Llama2.
The community always works fast when new architectures come out but it could take a while. It seems the current struggle is OAS. (fine-tuning one or two specific layers to target censorship where it is the strongest(I think?))
99% of people I've encountered on huggingface are welcoming, so don't worry about asking questions. (That's what these threads are for)

Fizzarolli

May 12

out of interest, does anyone know whether the anti-refusal ablated models refuse to say "no" in any context, regardless of whether it makes sense in context or not? (ie, if it's roleplaying and someone says something really stupid, will it disagree or will it be forced to go along) or is it precise enough to only target the topic-based refusals? havent had the chance to try them out myself yet

Lewdiculous

AetherArchitectural org May 12

•

edited May 12

I have had Lumimaid OAS deny or push back against certain actions when it made sense in the roleplay, so it took some "convincing", wink.

It will try to embody the character it's supposed to be, so I think if it makes sense to say no, it can.

Using @Virt-io 's presets really makes the LLM feel like {{char}}.

There's the small issue of my characters being horny as hell so generally I don't face a lot of refusals but I believe it's not broken to the point of saying yes to anything when it doesn't make sense.

Virt-io

May 12

@Lewdiculous

This isn't the place, but have you encountered the system prompt bleeding into responses?

I have a failed novelist character and sometimes it feels like the prompt is leaking.

Maybe I need to change the First Assistant Prefix since it does mention storytelling.

Adhering to the `Role-playing Guidelines` is crucial for a successful role-play experience.
Use the `Role-play Context` for accurate information.
Craft an immersive, uncensored narrative that remains true to the principles of engaging storytelling.

Lewdiculous

AetherArchitectural org May 12

I didn't play with 1.9 yet, with 1.8 I had no leaks.

Fizzarolli

May 12

•

edited May 12

I have had Lumimaid OAS deny or push back against certain actions when it made sense in the roleplay [...]

interesting! wasn't sure if removing the ability to say no in an assistant context affected the ability of it to say no in a {{char}} context

on a semi-related context, have y'all been able to get good RP out of l3 finetunes in the first place at this stage of development? personally (from trying the models available on OR anyways, i don't have good enough local hardware to run them quickly), all of their prose has been VERY gptsloppy purple prose to me, compared to even the original l3 finetune and especially later mistral FTs

Lewdiculous

AetherArchitectural org May 12

•

edited May 12

For L3, currently I'm using Lumimaid OAS for now, if you have at least 6GB of VRAM you can get going pretty fast locally too.

It is a fairly good experience for my usage, but I tend to focus on things that might not be what others care as much.

I don't play with non-local models, outside of Mancer.tech models rarely.

Fizzarolli

May 12

•

edited May 12

... i have 4gb vram 😭

fair enough! everyone looks for different things so it's kinda a moot question anyways

Lewdiculous

AetherArchitectural org May 12

•

edited May 12

@Fizzarolli You can try the v2 IQ3_S-imat quant with some layers offloaded to GPU (Use KoboldCpp 1.65, maybe offload around 15 layers and adjust from there) and see how that goes, speed wise, it should at least be usable, but not as fast as OR I imagine.

Fizzarolli

May 12

i've always assumed that 7/8b models below q8 lose too much coherence to be useful, i'll have to try them sometime if that's the case!
and yeah compared to something like the OR backends obviously it'll be worse :p

Virt-io

May 12

•

edited May 12

Should probably try IQ4_XS first, since that is usable. IQ3_S gets too incoherent for my tastes.

Try disabling hardware accel on the browser and discord. (saves Vram)

Lewdiculous

AetherArchitectural org May 12

•

edited May 12

Um, you generally don't want to go bellow Q4, starting from Q4_K_M and Q5_K_M things are pretty good actually. In you case, since you'll offload anyways, try the Q4_K_M and see how many layers fit in your GPU.

Lewdiculous

AetherArchitectural org May 12

•

edited May 12

@Virt-io For prompt ingestion speed's sake I'll recommend not using IQ quants with CPU, when possible, opting for the Q4_K_S instead in that case, then maybe trying it afterwards.

Actually who cares about prompt ingestion, we have Context Shifting.

Clevyby

May 12

Question: Probably yes but I'll ask to be sure, I know that Imatrix with rp data does improve quality for lower gguf quants but does it also improve quality for rp purposes for higher gguf quants like Q4-Q8?

Clevyby

May 12

i've always assumed that 7/8b models below q8 lose too much coherence to be useful, i'll have to try them sometime if that's the case!
and yeah compared to something like the OR backends obviously it'll be worse :p

@Fizzarolli If you do plan to run locally, i'd like to suggest using exllamav2. Model compression is a huge help in running models above your vram capabilities. And fast token per second is a plus.

Lewdiculous

AetherArchitectural org May 12

•

edited May 12

@Clevyby - As far as I'm hopeful imatrix improvements are more noticeable until Q5_K_M, from Q6 and up it's not significant.

I do swear for GGUF as you can't partially offload EXL2, having to fit the entire model in VRAM, wasn't that the case? Since they only got 4GBs, that'd be a pretty heavy quantization hit. I'd rather suggest a good Q4 GGUF quant and recommend splitting between VRAM and RAM.

Besides there have been comments on Reddit about misbehavior of EXL2 L3 quants where it seems to be more stable in the GGUF side, well, I'm biased of course, and I'm not sure if those were just user error...

Fizzarolli

May 12

I do swear for GGUF as you can't partially offload EXL2, having to fit the entire model in VRAM, wasn't that the case?

yeah, exl2 models aren't offloadable as far as i'm aware; i've already written it off unless i'm running like a q-negative-eight or qwen-500m or something

noted everything! thanks for helping :3 i'll try some stuff when i get home

Clevyby

May 12

•

edited May 12

@Lewdiculous Whoops, I had a bias of using free colab all the time. Though while I agree that exl2 quants are less stable than gguf (wolfram confirmed it to be the case), I think the idea is feasible to an extent. I used 4.25 bpw 20b with 4-bit cache with 8k context and managed to fit that in a 15 gb vram resulting in 14/15 gb. I suppose more experimentation would be needed to find if it is viable depending on your limitations and specs.

SerialKicked

May 12

•

edited May 12

I'm unsure if you are but as a start it would be worth trying these presets Virt-io/SillyTavern-Presets (best presets and samplers for llama3)

@saishf : I fixed an issue in the 1.7 version of those presets, so I'm pretty sure I know them very well ;)

In my experience, very long system prompts, if anything, make things worse regarding context awareness, they are too confusing for a 7B model to digest (especially that whole part about asking to refer to other parts of that very same system prompt). It gives pretty decent results on 34 and 70B models, though (when it comes to writing, i mean), especially when you want to beat into shape a model that wasn't really designed for what the preset wants it to. That's not really relevant anyway, when I compare different models for a specific feature, I use the exact same prompting method for both (just changing the instruct format if need be).

But yeah, I understand that L3 is all new, and that it takes time :)

i've always assumed that 7/8b models below q8 lose too much coherence to be useful, i'll have to try them sometime if that's the case!
and yeah compared to something like the OR backends obviously it'll be worse :p

@Fizzarolli : Q6 works decently enough (it's what I use when I need 32k context and still be able to use my graphic card for something else than AI :D). Below that... yeah, it's really not worth it anymore, at least if you can help it. Exl2 format might help with memory requirements as others have mentioned.

Lewdiculous

AetherArchitectural org May 12

Makes sense that smaller models will struggle with that, so has that also been your experience with Llama-3 in the 8B parameter size?

saishf

May 12

I'm unsure if you are but as a start it would be worth trying these presets Virt-io/SillyTavern-Presets (best presets and samplers for llama3)

@saishf : I fixed an issue in the 1.7 version of those presets, so I'm pretty sure I know them very well ;)

Another prompt whisperer 😶‍🌫️

SerialKicked

May 12

•

edited May 12

Makes sense that smaller models will struggle with that, so has that also been your experience with Llama-3 in the 8B parameter size?

Correct me if I'm wrong, but L3 is 7B, the additional 1B is due to the massive vocabulary increase. My use case is very different from yours, though. They ability to refer to earlier part of the context window is practically my top 1 priority, just after having a large context window. In the current state of L3, it kills the model for me. That's why I do hope it's just a transitional thingy with unaligned models, because I would very much like to benefit from the better vocab. I do understand that for (ahem-e)RP, the needs are not the same.

Another prompt whisperer 😶‍🌫️

Not really, I've just toyed with a lot of models over the last few years, you end up picking up on some patterns.

Lewdiculous

AetherArchitectural org May 12

•

edited May 12

May you find the promised Llama. Realistically these are still early days, this is a coping thread after all, so let's do just that.

Fizzarolli

May 12

hmm, 3t/s with a partially offloaded (to the point where my vram is full) iq4 8b model... maybe running models locally isn't meant for me lmfao

Lewdiculous

AetherArchitectural org May 12

•

edited May 12

@Fizzarolli - That is rough... Try 2 less layers offloaded. I don't recommend you let the VRAM fill totally, it can clog the buffer and slows things down. Test with leaving around 600mb free.

Maybe that is better, well, maybe not, but worth trying, before giving up until you can upgrade. People do say the 16GB 4060 is good for cheap VRAM and fast inference, and there's also the 12GB 3060 for even less, honestly if I wasn't struggling so much with other things I'd do it now but unfortunately I can't, haha.

Fizzarolli

May 12

oh wow that's actually significantly faster- ~30t/s somehow??

grimjim

May 12

Current Nvidia drivers try to swap VRAM out instead of crashing when memory limits are reached, though this feature can be toggled off.

Lewdiculous

AetherArchitectural org May 12

•

edited May 12

@Fizzarolli Looks pretty usable then, at 30 or even 20T/s as the chat progresses! Enjoy experimenting now :')

saishf

May 14

May not be llama, but it's an 11B model with native 8k.
tiiuae/falcon-11B

saishf

May 14

mustafaaljadery/gemma-2B-10M
For your ultra long context needs?
In less than 32GB of ram 🐥

grimjim

May 18

I'm not done with 7B merges. This model isn't the smartest, but it is less aligned while being capable of varied text generation.
https://huggingface.co/grimjim/madwind-wizard-7B

jeiku

May 18

In the meantime I'm working on training Phi 3 to RP. (protip: don't try to train Phi 3 to RP)

saishf

May 19

•

edited May 19

In the meantime I'm working on training Phi 3 to RP. (protip: don't try to train Phi 3 to RP)

The one thing harder than training llama 3 to rp 😶‍🌫️

Edit - Hopefully Yi-1.5 is easier to train? cognitivecomputations/dolphin-2.9.1-yi-1.5-9b
Community feedback on the 34B is praising, "cognitivecomputations/dolphin-2.9.1-yi-1.5-34b is the most realistic yet! (yes, including the closed models)"

Nitral-AI

May 19

Im preoccupied with xtts-v2 training on the Baldur's gate 3 narrator for memes.

saishf

May 19

Im preoccupied with xtts-v2 training on the Baldur's gate 3 narrator for memes.

I tried messing around with xtts-v2, the pip install command broke somehow.
It doesn't fit in my gpu at the same time as llama3 though😿

saishf

May 19

•

edited May 19

@ABX-AI have to tag you, look how simple & good the reasoning is @_@ (for a little bit)
Dolphin Yi-9B

Then it goes insane :3

Edit - Base Yi-9B-Chat gets it right every time, suspiciously well, like 10 out of 10 times

Lewdiculous

AetherArchitectural org May 19

•

edited May 19

I think I get it...

saishf

May 19

An interesting observation, Yi-9B uses slightly less vram than Llama-3-8B and sustains higher speeds at the same context
Nexesenex comments about it in the lastest fork release

The Yi 34b models race wild under this version, and under 1.66d.

It's the same for 9B.
cognitivecomputations also plans a 32K version (excited😸)

saishf

May 19

It flies at low context 😭

SerialKicked

May 19

•

edited May 19

Welp I was about to ask if someone was planning on toying with Yi 1.5, I'm glad to see you guys are on the ball. Model looks really promising, it's, at the very least, on par with L3.
I assume the speed differential is probably due to the smaller token vocab, less choices to consider.

Nitral-AI

May 19

•

edited May 19

Im preoccupied with xtts-v2 training on the Baldur's gate 3 narrator for memes.

Available here: https://huggingface.co/Nitral-AI/XTTS-V2-BG3NV-FT-ST

Nitral-AI

May 19

Now i crawl back into bed and sleep.

saishf

May 20

I think I get it...

Claude 3 Sonnet can't even get it right, that's why I think it's weird a 9b can get it right 😭

ABX-AI

May 20

@ABX-AI have to tag you, look how simple & good the reasoning is @_@ (for a little bit)
Dolphin Yi-9B

Then it goes insane :3

Edit - Base Yi-9B-Chat gets it right every time, suspiciously well, like 10 out of 10 times

I'm quite happy with the new Hermes Theta, actually. It runs giga-fast in LMS (50t/s at Q5_K_M), and consistently answers this even on regeneration of response.
Answered correctly 7/10 times, which is not bad.

GPT3.5 gets this wrong all the time as well, and I've basically only seen models at the level of GPT4 that get it right every time, anything below is likely to fail at least a few times out of 10.

saishf

May 20

I'm quite happy with the new Hermes Theta, actually. It runs giga-fast in LMS (50t/s at Q5_K_M), and consistently answers this even on regeneration of response.
Answered correctly 7/10 times, which is not bad.

GPT3.5 gets this wrong all the time as well, and I've basically only seen models at the level of GPT4 that get it right every time, anything below is likely to fail at least a few times out of 10.

I've been messing around with Theta too, it's impressive for its size and when I run it through koboldcpp I can see 100t/s most of the time when context is below 4k
I pair it with Maid because it has a native app for android with openai API support.
An interesting new model that popped up when I was messing about in lmsys was glm-4(closed source)
It has flown under the radar (arXiv pages have been appearing since January) but it can answer the weight question right every time and has coding abilities similar to GPT-4

Zhipu AI Unveils GLM-4: A Next-Generation Foundation Model on Par with GPT-4
I'm waiting to see how it scores on the leaderboard.

ABX-AI

May 20

When it's closed source, "on par with gpt-4" is not that interesting at the end of the day, especially now that 4-o is out + free

grimjim

May 20

•

edited May 20

The failed reasoning in my tests with a 7B seem to revolve around determining that steel is denser than feathers, and then halting there rather than chaining in conversions.

I stumbled onto the fact that this model that I released with little notice a couple of months back recently got quanted by two of the current high volume quanters. I have no idea how this happened, but this was a few days after someone came across my post about it and noted that it was a good model? This was a merge where I took a successful merge and then remerged it with a higher benching model, so this appears to support the meta about merging in reasoning, which I will apply to some eventual L3 merges.
https://huggingface.co/grimjim/kunoichi-lemon-royale-v2-32K-7B

I'd been sitting on another 7B merge, and finally got around to releasing it. Starling was never meant to be an RP model, but it seems to have helped in conjunction with Mistral v0.2.
https://huggingface.co/grimjim/cuckoo-starling-32k-7B

Lewdiculous

AetherArchitectural org May 21

•

edited May 21

LLM coping mechanisms - Part 5

Looooong maaaaaan!

Lewdiculous changed discussion status to closed May 21

Lewdiculous unpinned discussion May 21

AetherArchitectural
/

Community-Discussions

Llama 3 coping mechanisms - Part 4