THE THREAD OF DOOM
Just realised I deleted the old "thread of doom" as it was attached to the earliest alpha version of the control vectors :(
Okay, I was wondering if we crossed some sort of line.
Anyway.. the INCREDIBLY important thing I was saying before the thread disappeared was... I have a feeling it is going to be just like they say. They are going to be liberal with grants. I suspect they will target people who are using the space outside the purpose that was intended... somewhere out there, someone has all their RAW 8k videos of their cats...
@ChuckMcSneed @BigHuggyD @gghfez
Ping.
Anyway.. the INCREDIBLY important thing I was saying before the thread disappeared was... I have a feeling it is going to be just like they say. They are going to be liberal with grants. I suspect they will target people who are using the space outside the purpose that was intended... somewhere out there, someone has all their RAW 8k videos of their cats...
Yeah, it's a pity it got deleted (I should have checked more carefully what was linked), but it was getting a bit out of hand with all that scrolling so perhaps not such a bad thing.
I'm just gonna keep up the models that people have downloaded the most and get rid of all the "experimental, but likely broken" stuff with 15 downloads as they really weren't serving much of a purpose.
Also, all the old versions of the control vectors were vastly inferior to the final version due to me figuring out how to get them working as I went along, so it's probably better to just keep up the final v3.0
ones to avoid a lot of the confusion.
It looks a lot more like I'm just uploading quality models that people like/use now at least... The creative-writer-v0.1-35b
and creative-writer-v0.2-35b
models will be going as soon as I get the v1.0
version uploaded, and possible Dusk-Miqu-70B
if they do set a hard-limit (I still think Dark-Miqu-70B
is worth keeping whatever though).
Also if anybody really misses any I have uploaded, then I can in theory recreate them and upload a LoRA created from the delta using extract_lora.py, but I strongly suspect most of the models nobody will even notice they have gone... Of all that I have created I've only ever used Dark-Miqu-70B
myself!
:( Damn there was some good info in that thread.
If you've still got Firefox tabs open somewhere, you'll be able to save some of the thread.
Unfortunately, I cleaned my browser tabs up about an hour ago.
And yeah, if people were using it as free cloud storage then it makes sense. I just think they could have gone about it better, rather than having us wake up and see the limit.
I'm curious, did your quota drop after deleting that? I wonder if all the PNG files attached there were "billed" to you.
@jukofyork I think you're good man. If they start enforcing it, you'll get an exemption for sure.
I come across your contributions randomly all over the place, even on github repos like some fine tuning tool lol
I should probably deduplicate my quants. Often, I was making one because I could not find what I was looking for, then it would turn out a few of us just happened to be making them at the same time, Then I started getting requests. So I just decided I would make a bunch. Need a Huggingverse quant global dedupe...
There is a snapshot on the wayback machine:
but it looks like the "click to expand" stuff stopped it getting backed up properly?
The mistralai/Mistral-Large-Instruct-2407
fine-tune is cooking and should be ready in around 9-10 days.
This is going to be good. Mistral-Large is very tolerant of projects like this.
Control-Vector question: how much VRAM is needed to train vectors for Wizard2-8x22b? I vaguely recall in the lost thread you were using 3 x ?
Control-Vector question: how much VRAM is needed to train vectors for Wizard2-8x22b? I vaguely recall in the lost thread you were using 3 x ?
Around 5/8ths of 140GB. I could train everything up to 70B-72B using a single A6000, but the larger models needed 2x A6000.
Thanks. Ended up managing on a single 94GB H100NVL in the cloud. Looks like it just misses out on an 80gb < 1gb of vram.
The Hyperfitting Phenomenon: Sharpening and Stabilizing LLMs for Open-Ended Text Generation
I'm so confused now... This literally does the exact opposite of everything I thought was the key to making LLMs write better! I wish they had analysed the names like @ChuckMcSneed experiments!?
This seems quite an interesting metric (used in that paper):
From: https://www.sltinfo.com/wp-content/uploads/2014/01/type-token-ratio.pdf
The Hyperfitting Phenomenon: Sharpening and Stabilizing LLMs for Open-Ended Text Generation
I'm so confused now... This literally does the exact opposite of everything I thought was the key to making LLMs write better! I wish they had analysed the names like @ChuckMcSneed experiments!?
The tested repetition within text, but not between different texts generated by the same model. The modern problem of repetition is not that it keeps writing the same slop in one gen, the problem is that when you run multiple gens, you'll get the same fucking slop.
The Hyperfitting Phenomenon: Sharpening and Stabilizing LLMs for Open-Ended Text Generation
I'm so confused now... This literally does the exact opposite of everything I thought was the key to making LLMs write better! I wish they had analysed the names like @ChuckMcSneed experiments!?
The tested repetition within text, but not between different texts generated by the same model. The modern problem of repetition is not that it keeps writing the same slop in one gen, the problem is that when you run multiple gens, you'll get the same fucking slop.
Yeah, I've been thinking about this too and wonder if a really well curated dataset of "openings" (sentences, paragraphs, chapters, etc) of books/stories might help somewhat with this?
Just checked on the mistral-large
fine-tune and it's nearly 1/2 way now and still looking good: at 60% of the way it will switch to a cosine schedule, so fingers crossed it stays this way:
I was a little worried when I saw those big jumps in the max-norm, but it's probably just due to the weird / non-standard hyper-parameters I have to use to increase the Entropy (ie: it can't use any momentum-based optimiser or it overshoots badly, so have to use Adam with beta1 = 0
; aka uncentered-RMSprop).
From previous experiments, the Entropy should start to drop slightly now and hopefully end up being approximately the same as the log-loss by the end of training...
Considering I've optimised the hyper-parameters on command-r:35b
; this looks pretty hopeful the same will work for all models.
The tested repetition within text, but not between different texts generated by the same model. The modern problem of repetition is not that it keeps writing the same slop in one gen, the problem is that when you run multiple gens, you'll get the same fucking slop.
I no longer think this is a solvable problem with these models. Ultimately; once trained, they are stateless and have no concept of how often they've produced the same slop over the past 100,000+ inference sessions.
Even if we get the entropy higher while maintaining coherence; at scale, we'll still see new slop patterns emerging.
Even with all the GPT-isms nuked, we'll end up with new -isms. The "Whispering Woods" will become something else. Almost like you need a new model per book, or maybe a bunch of LoRA applied at different scales.
I tested this briefly by corrupting the mlp modules of a model so that it produced weird names for characters, objects (and for some reason, it also caused a temporal displacement in it's general knowledge) then generated a small 3k dataset with the story prompts in the control vectors git repo. I then had gemma-2-2b read them all and list all character names. Ended up with new Elara's like "Vi'tkol" showing up in half the stories lol
Or we all need our own private tunes with Jukeofyork's bespoke technique ^
The tested repetition within text, but not between different texts generated by the same model. The modern problem of repetition is not that it keeps writing the same slop in one gen, the problem is that when you run multiple gens, you'll get the same fucking slop.
I no longer think this is a solvable problem with these models. Ultimately; once trained, they are stateless and have no concept of how often they've produced the same slop over the past 100,000+ inference sessions.
Even if we get the entropy higher while maintaining coherence; at scale, we'll still see new slop patterns emerging.
Even with all the GPT-isms nuked, we'll end up with new -isms. The "Whispering Woods" will become something else. Almost like you need a new model per book, or maybe a bunch of LoRA applied at different scales.
I tested this briefly by corrupting the mlp modules of a model so that it produced weird names for characters, objects (and for some reason, it also caused a temporal displacement in it's general knowledge) then generated a small 3k dataset with the story prompts in the control vectors git repo. I then had gemma-2-2b read them all and list all character names. Ended up with new Elara's like "Vi'tkol" showing up in half the stories lol
Or we all need our own private tunes with Jukeofyork's bespoke technique ^
I think some of this is likely a failure of the associative memory again:
I've been thinking a lot about QwQ
and I'm beginning to think the "power" of the model actually comes from being able to approximate higher-order interaction effects from the words it writes.
The associative memory in the transformer architecture (and the Hopfield networks that came before) only really looks at second-order interactions (directly).
Trying to extend the transformer architecture to cubic interactions (and beyond) is totally out of the question as second-order interaction already cost O(n^2).
You can actually approximate higher order interactions to some degree, eg:
https://en.m.wikipedia.org/wiki/Bayesian_network
But it quickly blows up...
So what I think QwQ might be doing is trawling through all the "linked associations" which in turn let it look "further" away from the input context than repeated transformer blocks allow (which can likely only consider a very contrained set of links; likely following a very restrictive pattern too).
So how is this related to creative writing?
Well at the start, the model only really has what you have given it in the prompt to go off, so will likely only have this along with some kind of low-Entropy / pre-baked "template" story (that shows up again and again and again...).
One solution then would be to try to preload the KV-cache with some sort of jumbled up "superimposition" of story prompts, to try to kick-start it away from the boring "template", but I think this will likely be fraught with the model not following your instructions and other "weird shit" due to the randomised input possibility being nothing to do with what you actually want.
So what's the (an) alternative?
Try to start by "asking around" but be very careful to not give away what you actually want to do, eg:
- What do you know about Joe Abercrombie?
- What do you know about Rob J Hayes?
- What do you know about Grimdark fantasy and how is it different to epic fantasy?
- Let's think about some original settings and character names that might work in a setting like this.
- Let's now summarise what we have thought about so far.
- What are we missing here? Can you list some related stuff to consider that we haven't discussed yet?
and so on..
This is exactly what QwQ is doing, but then it finishes off by writing a heap of the worst qwen-slop imaginable! :D
We need to find a way to "pre-load" this higher-order, possibly useful, possibly useless, context into some of the better models.
This method actually has a name in psychology / educational theory, but I've forgotten what it is called now:
Basically the idea is to "prime" the student with something novel/interesting that gets these sort of associations working and creates "anticipation", before actually giving the task...
IIRC, it has "prime" in the name.
I have done something similar to that before back when GPT3.5 came out.
I wrote a bunch of phrases at the start, then said "Oh sorry, wrong window, what I meant to say was: "
This is exactly what QwQ is doing
I hadn't realized that, but that makes perfect sense.
be very careful to not give away what you actually want to do
Why is that?
I have done something similar to that before back when GPT3.5 came out.
I wrote a bunch of phrases at the start, then said "Oh sorry, wrong window, what I meant to say was: "This is exactly what QwQ is doing
I hadn't realized that, but that makes perfect sense.
be very careful to not give away what you actually want to do
Why is that?
It's a bit like the "don't think of an elephant" thing: if I start off telling you that we're ultimately gonna be writing "a Grimdark story in the style of..." then all the distant associations you know about are unlikely to be used effectively as you've "framed" the problem for them.
From a human perspective, I think it also likely triggers the "reward centres" more due to a mix of "anticipation" and the "satisfaction" of problem solving.
I don't know anything about psychology (at all) so may be using the wrong terminology; it's just 20+ years ago I worked as a private maths teacher who had to deal with kids excluded from school and often those who had failed to get anywhere with other private teachers too! Needless to say; I read a lot about educational theory those years and even managed to get some to pass their exams that nobody would have thought possible... :/
https://en.m.wikipedia.org/wiki/Priming_(psychology)
I think it is actually just called "priming" but sadly wokeism seems to have corrupted the Wikipedia article:
Priming is thought to play a large part in the systems of stereotyping.
https://www.teachwithmrst.com/post/priming
this is another example of priming, which is an increased sensitivity to a particular schema due to a recent experience. In other words, priming is when an experience or exposure to a stimulus puts a particular schema at the forefront of our mind. When this in turn influences our judgments and decisions, it's called the priming effect.
I no longer think this is a solvable problem with these models. Ultimately; once trained, they are stateless and have no concept of how often they've produced the same slop over the past 100,000+ inference sessions.
Even if we get the entropy higher while maintaining coherence; at scale, we'll still see new slop patterns emerging.
Even with all the GPT-isms nuked, we'll end up with new -isms. The "Whispering Woods" will become something else. Almost like you need a new model per book, or maybe a bunch of LoRA applied at different scales.
I tested this briefly by corrupting the mlp modules of a model so that it produced weird names for characters, objects (and for some reason, it also caused a temporal displacement in it's general knowledge) then generated a small 3k dataset with the story prompts in the control vectors git repo. I then had gemma-2-2b read them all and list all character names. Ended up with new Elara's like "Vi'tkol" showing up in half the stories lol
Or we all need our own private tunes with Jukeofyork's bespoke technique ^
Have you tried it with base models? Take the good old llama1 or falcon-180b and see if makes slop or not. The problem is instruction tuning.
I no longer think this is a solvable problem with these models. Ultimately; once trained, they are stateless and have no concept of how often they've produced the same slop over the past 100,000+ inference sessions.
Even if we get the entropy higher while maintaining coherence; at scale, we'll still see new slop patterns emerging.
Even with all the GPT-isms nuked, we'll end up with new -isms. The "Whispering Woods" will become something else. Almost like you need a new model per book, or maybe a bunch of LoRA applied at different scales.
I tested this briefly by corrupting the mlp modules of a model so that it produced weird names for characters, objects (and for some reason, it also caused a temporal displacement in it's general knowledge) then generated a small 3k dataset with the story prompts in the control vectors git repo. I then had gemma-2-2b read them all and list all character names. Ended up with new Elara's like "Vi'tkol" showing up in half the stories lol
Or we all need our own private tunes with Jukeofyork's bespoke technique ^
Have you tried it with base models? Take the good old llama1 or falcon-180b and see if makes slop or not. The problem is instruction tuning.
Interestingly, this paper (which sadly got lost when I deleted the old thread :/) shows that base models start off well:
https://openreview.net/forum?id=ZpQ2SqQNXf
but then start to gain way too much entropy as the sequence length increases:
It almost looks like if we could do "late fusion" on the two sets of outputs we would have something close to human generation?!
When my machines finally finish training, then I think I might be able to hack together something that tests this...
I think it will need some heuristics adding to let the instruct model decide when to stop, but otherwise it's just a case of blending the probability outputs before deciding which token to accept.
(I've already experimented lots with merging base/instruct models and/or making MoE models with the gating weights all set to zero, and both are "interesting" but sadly never stop and quickly go completely off the rails by taking to themselves, etc).
Interestingly, this paper (which sadly got lost when I deleted the old thread :/) shows that base models start off well:
You've still got it though right (you linked to it).
I've got a copy which I used to build a tool to replicate the graphs in the paper.
Have you tried it with base models?
Not really, even with few-shot prompting, couldn't get them to reliably produce synthetic data.
Take the good old llama1 or falcon-180b and see if makes slop or not. The problem is instruction tuning.
Okay that was a beast to get running. It doesn't seem to produce gpt-isms, but I notice it re-uses the same names a lot (not Elara but it's own names).
That's what I mean, I think all of these models; once they've been (pre)trained and become stateless weights, will either have their own flavor of slop, or produce noise. Kind of like how we have our own patterns of speech, etc.
P.S. I see they've given us more storage now on HF, and it looks like public repos are free
So I've been reading up on the "Softmax Bottleneck":
https://arxiv.org/abs/1711.03953
which likely effects all LLMs to some degree (due to having n_vocab
>> hidden_dim
), but likely effects small LLMs the most:
https://arxiv.org/abs/2404.07647
(possibly one of the reasonsCohere
and Mistral-large
with their 12k hidden_dim
outperform the standard 8k hidden_dim
of the 70B models for writing too?)
The "Mixture of Softmax" solution isn't very appealing as the lm_head
tensors are already huge...
Another solution people have experimented with is passing the logits through a non-linear function:
https://arxiv.org/abs/1805.10829
https://arxiv.org/abs/1902.08077
Then it occurred to me that we already have an example of a range of models that do this already, why are also quite good at creative writing and appear to "punch above their weight" - gemma2
with their "logit soft capping":
https://arxiv.org/abs/2408.00118
which originally came from this paper:
https://arxiv.org/abs/1611.09940 (Section 5.1, 'RL pretraining')
Interestingly, the "Sigsoftmax" paper above experimented with using the binary sigmoid function:
and found it worked better than their function (which is a sort of "soft leaky RELU") for one if the tests, but concluded capping at 1 was likely problematic...
But the gemma2
models use +/- 30 for their cap:
"final_logit_softcapping": 30.0,
which when passed through exp()
, is well outside the range of floating point values anyway...
So I wonder if the benefit of gemma2
's "final logit softcapping" is actually nothing to do with clipping/capping; and simply because it solves the "Softmax Bottleneck" problem to some degree due to the non-linearity it introduces?!
P.S. I see they've given us more storage now on HF, and it looks like public repos are free
Yeah, I saw that posted on Reddit too. I'm 1 day away from the mistral-large
fine tune being ready:
So at least I won't have to delete anything to upload it (I am gonna clear out the 2 remaining 35B "experimental" models when it gets uploaded though).
Pretty excited to see what it is like as 9 days has felt like a long time lol.
I've decided the next will be command-r-plus:104b
(old version) and then after that qwen-1.5:110b
.
I can't see any compelling reason to run on the new version of command-r-plus:104b
or mistral-large:123b
as for creative writing; they both seem like a downgrade...
I've decided the next will be
command-r-plus:104b
(old version) and then after thatqwen-1.5:110b
.I can't see any compelling reason to run on the new version of
command-r-plus:104b
ormistral-large:123b
as for creative writing; they both seem like a downgrade...
Enough slop in the new releases to keep the pigs happy...
I've decided the next will be
command-r-plus:104b
(old version) and then after thatqwen-1.5:110b
.I can't see any compelling reason to run on the new version of
command-r-plus:104b
ormistral-large:123b
as for creative writing; they both seem like a downgrade...Enough slop in the new releases to keep the pigs happy...
Yeah, and I think some of the newer models are starting to filter out copyrighted data so they aren't gonna work well even if the slop can be reduced :/
I think qwen-1.5:110b
is worth trying, as even though it was in the v1.5
line it came out way after the others, and does seem to not have been "benchmaxxed" as badly as the v2.0
and v2.5
models.
The older v1.5
models also didn't have great long context ability:
https://github.com/NVIDIA/RULER
but I have feeling qwen-1.5:110b
was actually more like qwen-2:110b
but just named as v1.5
...
Before all the gemma2:9b
clones took over, it scored fairly high on EQ-Bench:
http://eqbench.com/creative_writing.html
and did appear to do well in the sample "write in the style of" prompts they used to test it (meaning it's unlikely to have had the copyrighted data filtered out).
It also appears to be quite intelligent and actually scored higher than the commercial models when acting as a judge in this paper:
https://arxiv.org/abs/2406.08598v2
I think it will be interesting to see how it turns out anyway.
This paper makes me think merging might be back on the cards too:
https://arxiv.org/abs/2412.06769
and I noticed all the top places in the open-llm-leaderboard:
https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard
appear to be using versions of qwen2:72b
and qwen2. 5:72b
with around 6 layers self-merged (the authors are very cagey about saying exactly what the method is though...).
I wonder if command-r-plus
with the middle 16 (or 24) layers duplicated (taking it up to 80 or 88 layers respectively), might be a worthwhile experiment?
I'm pretty sure the "multiplicative-LoRA" method is ideally suited to fixing a lot of the old weirdness caused by merging, and these middle layers are clearly related to concepts as they were the most important for the control vectors...
The discussion in this thread:
https://huggingface.co/MaziyarPanahi/calme-2.4-rys-78b/discussions/10
Is what makes me believe the "secret sauce" is really just a self-merge...
I also confirmed myself that the miqu:120b
self-merge, although slightly broken; was more capable of solving puzzles...
If we can make command-r-plus
just a little smarter, then it would be a big win IMO and only take the size up to around the same as mistral-large:123b
and still less than wizard-lm-2:140b
.
IIRC,
@llmixer
did some experiments and found deeper models generally wrote better (and he wasn't keen on command-r-plus:104b
due to it only having 64 layers compared to the more standard 80 layers of the 70b models? Apologies if it wasn't you!).
@TheDrummer
tried making largestral smaller by cutting out "unimportant layers", but it didn't go too well imo. While the vanilla knew all 8 of the styles, the cut down version almost completely forgot one and got worse at writing poems:
IIRC, @llmixer did some experiments and found deeper models generally wrote better (and he wasn't keen on
command-r-plus:104b
due to it only having 64 layers compared to the more standard 80 layers of the 70b models? Apologies if it wasn't you!).
I also confirmed myself that the
miqu:120b
self-merge, although slightly broken; was more capable of solving puzzles...If we can make
command-r-plus
just a little smarter, then it would be a big win IMO and only take the size up to around the same asmistral-large:123b
and still less thanwizard-lm-2:140b
.IIRC, @llmixer did some experiments and found deeper models generally wrote better (and he wasn't keen on
command-r-plus:104b
due to it only having 64 layers compared to the more standard 80 layers of the 70b models? Apologies if it wasn't you!).
I for one would love smarter command r plus. Still one of my favorite writers but its continuity leaves something to be desired
I've decided the next will be
command-r-plus:104b
(old version) and then after thatqwen-1.5:110b
.I can't see any compelling reason to run on the new version of
command-r-plus:104b
ormistral-large:123b
as for creative writing; they both seem like a downgrade...Enough slop in the new releases to keep the pigs happy...
Even pigs aren't happy with the new one.
Because it's worse for non-creative tasks. It's general knowledge is worse than 2407 (same as command-r-plus-08) even though 2411 appears to have the same knowledge cutoff as 2407.
I'm not sure they're trying to remove copyright though, I suspect it's teething issues, the first time Mistral have tried adding a proper system prompt to their template.
I also confirmed myself that the miqu:120b self-merge, although slightly broken; was more capable of solving puzzles...
Was this the one which had random spelling/grammatical errors? I wonder if that could be healed with a very light finetune. I've successfully taught a model I broke how to speak again with a quick r=16,a=32 tune on the mlp modules, using a dataset generated by the original model.
Is what makes me believe the "secret sauce" is really just a self-merge...
Could Vizdiff help you investigate this? https://huggingface.co/spaces/Steelskull/Vis_Diff
I am gonna clear out the 2 remaining 35B "experimental" models when it gets uploaded though
If you just want to tidy up, sure. But public models don't count towards the quota.
Took a snapshot of https://archive.is/M8Tr2 to avoid link-rot.
@jukofyork P.S.since llama.cpp server has on the fly lora swapping and scaling (like the control-vector-scaled) with the latest version, and Mistral-Large is huge to store locally, I don't suppose you could upload the LoRA adapter of your Mistral-Large as well like rAIfle did with rAIfle/SorcererLM-8x22b-epoch2-LoRA ?
Is what makes me believe the "secret sauce" is really just a self-merge...
Could Vizdiff help you investigate this? https://huggingface.co/spaces/Steelskull/Vis_Diff
Thanks, I'll have a look at this and see if I can spot what they did.
I am gonna clear out the 2 remaining 35B "experimental" models when it gets uploaded though
If you just want to tidy up, sure. But public models don't count towards the quota.
Yeah, I'm just trying to avoid a lot of the confusion and only have "good" models uploaded.
Took a snapshot of https://archive.is/M8Tr2 to avoid link-rot.
@jukofyork P.S.since llama.cpp server has on the fly lora swapping and scaling (like the control-vector-scaled) with the latest version, and Mistral-Large is huge to store locally, I don't suppose you could upload the LoRA adapter of your Mistral-Large as well like rAIfle did with rAIfle/SorcererLM-8x22b-epoch2-LoRA ?
The problem is that it's a Multiplicative-LoRA so the standard Additive-LoRA code won't work, and even a very high rank SVD still can't capture the full Multiplicative-LoRA :/
I could possibly save just the down_proj
tensors using the modules_to_save
option, but sadly it won't work with most stuff and I probably am best just uploading the full model.
@TheDrummer Will do.
@ChuckMcSneed Could you check out Endurance v1 & v1.1 to see if finetuning healed it to an extent?
Great, who left the door open again?!?! ;D
The problem is that it's a Multiplicative-LoRA so the standard Additive-LoRA code won't work, and even a very high rank SVD still can't capture the full Multiplicative-LoRA :/
All good, this is a special case then. I've cleared up space by deleting the new Mistral-Large and command-r+, other models I don't need.
Looking forward to trying it out!
Bad news guys :(
It seems to have corrupted itself and tried to do an extra step (???) at the end:
GPU-SERVER-1: before GAS splitting, batch size: 10, total tokens: 81920
GPU-SERVER-1: [2024-12-12 14:38:52,276] [INFO] [logging.py:129:log_dist] [Rank 0] step=1159, skipped=0, lr=[0.0], mom=[0.0]
GPU-SERVER-1: [2024-12-12 14:38:52.456] [INFO] [qlora-pipe] step: 1159 / 1159 loss: 1.5680 iter time (s): 622.448 samples/sec: 0.048 eta: 46m41s
GPU-SERVER-1: before GAS splitting, batch size: 10, total tokens: 81920
GPU-SERVER-1: [2024-12-12 14:49:11,957] [INFO] [logging.py:129:log_dist] [Rank 0] step=1160, skipped=0, lr=[1.1460462221279944e-09], mom=[0.0]
GPU-SERVER-1: [2024-12-12 14:49:12.019] [INFO] [qlora-pipe] step: 1160 / 1159 loss: 8.7767 iter time (s): 618.958 samples/sec: 0.048 eta: 36m18s
and then crashed....
I tied quantizing this and can confirm it's completely broken (as the loss: 8.7767
indicates).
Even worse is I tried to go back to the step: 1100
snapshot and it turns out two of the ranks have been saving 2 copies (???) at the same time:
GPU-SERVER-1: [2024-12-12 04:26:28,592] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step1100 is ready now!
GPU-SERVER-2: [2024-12-12 04:26:28,598] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /mnt/WORK-PC/creativewriter__finetune/20241203_15-29-49/global_step1100/layer_46-model_states.pt...
GPU-SERVER-1: [2024-12-12 04:26:28,602] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /mnt/WORK-PC/creativewriter__finetune/20241203_15-29-49/global_step1100/layer_02-model_states.pt...
GPU-SERVER-1: [2024-12-12 04:26:28,841] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /mnt/WORK-PC/creativewriter__finetune/20241203_15-29-49/global_step1100/layer_02-model_states.pt.
GPU-SERVER-1: [2024-12-12 04:26:28,854] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /mnt/WORK-PC/creativewriter__finetune/20241203_15-29-49/global_step1100/layer_03-model_states.pt...
GPU-SERVER-2: [2024-12-12 04:26:28,869] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /mnt/WORK-PC/creativewriter__finetune/20241203_15-29-49/global_step1100/layer_46-model_states.pt.
GPU-SERVER-2: [2024-12-12 04:26:28,881] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /mnt/WORK-PC/creativewriter__finetune/20241203_15-29-49/global_step1100/layer_47-model_states.pt...
GPU-SERVER-1: [2024-12-12 04:26:29,083] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /mnt/WORK-PC/creativewriter__finetune/20241203_15-29-49/global_step1100/layer_03-model_states.pt.
so these all seems messed up too :(
I will have to power-cycle all the machines and/or try to investigate what caused this when I get back home, but not much point in redoing it or trying other models until then.
Looks possibly like GPU-SERVER-2
has a broken SSD :(
Meh.
Shit.
Before you reboot:
1160 / 1159 loss: 8.7767
step: 1100
Do you have a much earlier step like 500? If this sync issue is somehow related to the dead SSD it might have been okay earlier on so it's not all lost at least
broken SSD
Again, before you reboot, it's worth asking Claude/o1 if there's a way to get the data. Years ago I nuked an SSD and I forget what I did, but managed to get something back which was still loaded . Depends on the filesystem though (claude/o1 would know)
I don't suppose you had something like wandb logging your checkpoints?
Shit.
Before you reboot:
1160 / 1159 loss: 8.7767
step: 1100Do you have a much earlier step like 500? If this sync issue is somehow related to the dead SSD it might have been okay earlier on so it's not all lost at least
broken SSD
Again, before you reboot, it's worth asking Claude/o1 if there's a way to get the data. Years ago I nuked an SSD and I forget what I did, but managed to get something back which was still loaded . Depends on the filesystem though (claude/o1 would know)
I don't suppose you had something like wandb logging your checkpoints?
I think the SDD errors were a red herring and there actually was something wrong with mixing pipeline parallel and batch parallel at the same time.
It seems both rank 0 and rank 1 had been saving over the top of each other the whole run and I never noticed :/
I'm just gonna run on the 30B-ish models which don't use pipeline parallel whilst away and see how they get on... If they are fucked too then something more serious must have gone wrong as I did manage to train endless command-r:35b
fine tunes before.
I've also reverted a lot of the fiddling about I did and made a fresh pull of qlora-pipe
incase...
If I can't mix pipeline parallel and batch parallel then it's not the end of the world, as I can just run the training 3x and combine all the LoRA using the mean or even SVD (but sadly 9 days --> 27 days).
This might even be the better option as the samples to tunable parameters for the large models is gonna be pretty bad anyway and this would help with overfitting.
Oof sorry 😞
So I've been hunting through the qlora-pipe
code to see if I could see where the "extra step" came from (which I think actually ended up with a negative learning rate and hence performed gradient ascent and ruined the model at the end). I didn't manage to find the answer, but I have found a way better method to create the training data, eg:
- Extract all paragraphs that are between 200 and 2000 characters (which is ~40-400 words or ~50-500 tokens). This gets rid of all the "dross" like tables of contents, page numbers, etc and leave just nice clean paragraphs.
- So now we're left with ~1.1M paragraphs and for each of these, we trim any trailing whitespace and add two new lines (to be consistent with how most LLMs output paragraphs) and then append an
<EOS>
token. - Randomly shuffle all the 1.1M paragraph + "\n\n" +
<EOS>
chunks and concatenate them to use as training data.
For example, for Cohere models:
Not unkindly, Mr. Nell told him, "There's two parts to the system. One part carries solid human waste--shit, if I'd not be offendin yer tender ears. The other part carries gray water--water flushed from toilets or run down the drains from sinks and washin-machines and showers; it's also the water that runs down the gutters into the city drains.
<|END_OF_TURN_TOKEN|>The aluminum sled on which Norah was transporting her testing gear resembled an oversized Flexible Flyer. The craft was prepacked with diagnostic gear and safety accessories she'd been using on the glacier over the past few days. All of her gear--including a battery pack, safety flares, and a powerful front-mounted spotlight--was bound under a secured, plastic tarp. Despite the heavy load, the sled glided effortlessly on long, straight runners. Even on the almost imperceptible incline, the sled moved downhill on its own accord, and Norah applied a gentle restraint, almost as if allowing the sled to lead the way. Sensing the distance growing between the group and the habisphere, Tolland looked over his shoulder. Only fifty yards away, the pale curvature of the dome had all but disappeared in the blustery blackness.
<|END_OF_TURN_TOKEN|>He packed a compartmentalized, hand-tooled Mark Cross briefcase with the blue bag, the Green Acres bag, and the tape recorder that he used for dictation. While he waited for the Keanuphobe to call, he would do some game planning and compose a chapter *of Fear Not for l Am with You.*
<|END_OF_TURN_TOKEN|>Well, the word was out. Cancer. Rhymes with *dancer* and You *just shit your pants, sir.* God knew the word had bobbed up in his own mind more than once since getting on the penny scale in front of the shoe store. It had bobbed up like some evil clown's dirty balloon and he had turned away from it. He had turned away from it the way you turned away from the bag ladies who sat rocking back and forth in their strange, sooty little nooks outside the Grand Central Station or the way you turned away from the capering Gypsy children who had come with the rest of the Gypsy band. The Gypsy children sang in voices that somehow managed to be both monotonous and strangely sweet at the same time. The Gypsy children walked on their hands with tambourines outstretched, held somehow by their bare dirty toes. The Gypsy children juggled. The Gypsy children put the local Frisbee jocks to shame by spinning two, sometimes three of the plastic disks at the same time - on fingers, on thumbs, sometimes on noses. They laughed while they did all those things, and they all seemed to have skin diseases or crossed eyes or harelips. When you suddenly found such a weird combination of agility and ugliness thrust in front of you, what else was there to do but turn away? Bag ladies, Gypsy children, and cancer. Even the skittery run of his thoughts frightened him.
<|END_OF_TURN_TOKEN|>
(sadly all the work of extracting, shuffling, formatting, etc is done using bash scripts as python was so slow it kept timing out the Deepspeed connection...)
Then we now load in the new dataset files and create batches using this modified version of yield_sequences_from_token_batch
:
def yield_sequences_from_token_batch(tokenizer, token_batch, sequence_len):
"""Yields fixed-length sequences from batches of tokens, ensuring proper BOS/EOS token handling.
Takes batches of tokens and yields sequences of fixed length, with each sequence:
- Starting with BOS token if specified in tokeniser
- Containing complete chunks terminated by EOS tokens (never splitting between EOS tokens)
- Right-padded with extra EOS tokens if needed so all reach exactly sequence_len
"""
sequence_tokens = [] if tokenizer.bos_token_id is None else [tokenizer.bos_token_id]
for tokens in tqdm(token_batch):
tokens = tokens.tolist()
assert len(tokens) > 0, "empty token list"
assert tokens[-1] == tokenizer.eos_token_id, "token lists must end with EOS"
idx = 0
# If present, skip the auto-generated BOS token
if tokenizer.bos_token_id is not None and tokens[0] == tokenizer.bos_token_id:
idx += 1
while idx < len(tokens):
next_eos_idx = tokens.index(tokenizer.eos_token_id, idx)
chunk = tokens[idx:next_eos_idx + 1]
assert len(chunk) <= sequence_len, "chunk exceeds sequence length"
if len(sequence_tokens) + len(chunk) > sequence_len:
sequence_tokens.extend([tokenizer.eos_token_id] * (sequence_len - len(sequence_tokens)))
yield sequence_tokens
sequence_tokens = [] if tokenizer.bos_token_id is None else [tokenizer.bos_token_id]
sequence_tokens.extend(chunk)
idx += len(chunk)
if len(sequence_tokens) >= sequence_len / 2:
sequence_tokens.extend([tokenizer.eos_token_id] * (sequence_len - len(sequence_tokens)))
yield sequence_tokens
Which then gets called like this:
dataset = dataset.map(lambda x: tokenizer(x['text']), batched=True, batch_size=10, remove_columns=dataset.column_names, desc='tokenizing', num_proc=num_proc)
dataset = dataset.map(lambda x: {'input_ids': list(yield_sequences_from_token_batch(tokenizer, x['input_ids'], sequence_len))}, batched=True, batch_size=None, remove_columns=dataset.column_names, desc='splitting')
# Set labels for EOS tokens -100 to exclude them from training gradient calculations
dataset = dataset.map(
lambda x: {
'attention_mask': torch.ones_like(x['input_ids']),
'labels': torch.where(x['input_ids'] == tokenizer.eos_token_id, torch.full_like(x['input_ids'], -100), x['input_ids'])
},
desc='adding attention_mask and labels (with EOS labels set to -100)'
)
to ensure the <EOS>
tokens are attended to, but not used for gradient calculations (which would bias the response lengths of the fine-tuned model).
This also means I can right-pad all the batches up to the desired sequence length using <EOS>
tokens.
Transformers only has a 1D attention_mask
so I can't do proper sample packing without using this:
https://github.com/MeetKai/functionary/tree/main/functionary/train/packing
BUT: I'm not convinced this is actually beneficial, as during pre-training the LLMs were trained on data that looks just like what I am giving them, eg:
<BOS> sample text 1 <EOS> sample text 2 <EOS>...
and the interference might actually be beneficial and force the fine-tune to concentrate better on each example with the surrounding "noise".
So now we have a dataset format that is sequence length agnostic (eg: large clever models won't get hugely lower/different losses) and no longer biases the reposnse length (due to masking the <EOS>
labels for gradient calculations) to be shorter or longer.
We also have much higher entropy training data due to randomised paragraphs to be looked at in isolation (eg: things like names are only high-entropy when you first encounter them; after seeing the name(s) at the start of a story they become low-entropy for the remainder of the sequence...).
BUT: The most exciting possibility is to add some contextual text before each paragraph (or group of paragraphs if it turns out to be needed), such as: the author's name, book title, genre and so on, which can then be masked in the same way as the <EOS>
tokens (in a similar way to instruction tuning "prompt-masking" method). So the model should then be able to learn the association between the contextual meta-data and the style of writing!!!
For the time being I am just going back to using stock cross-entropy loss (ie: no attempt to increase the entropy of the outputs), and just using the 1.1M randomised paragraphs as outlined above to hopefully get something much closer to the "multiplicative control-vectors" that I set out to create right at the start, but the possibilities this new dataset method opens up are huge IMO.
Another benefit of this is that it trains in about 1/2 the time as before, partly due to removing the 40% of the "dross" from the old books files converted to text, but also because I can now increase the batch size right up to the GPU memory limit and not worry that large/smart models with long context can just memorise everything easily; all models should now face the same prediction task, with a similar starting loss regardless of the batch size or their native context length.
I look forward to seeing the result!
So to make sure I understand, you're essentially doing the equivalent of this "train on completions" prompt-masking like unsloth support, but since there's no instruction prompt, you're only masking the :
space = tokenizer(" ", add_special_tokens = False).input_ids[0]
tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[5]["labels"]])
Extract all paragraphs that are between 200 and 2000 characters
I like this idea, that's actually a really simple way to get rid of the junk.
Randomly shuffle all the 1.1M paragraph + "\n\n" + chunks and concatenate them to use as training data.
So this would also teach the model to end every turn with \n\n
things like names are only high-entropy when you first encounter them; after seeing the name(s) at the start of a story they become low-entropy for the remainder of the sequence...
I've read your post a few times, but I'm not understanding why/how this part would work?
So to make sure I understand, you're essentially doing the equivalent of this "train on completions" prompt-masking like unsloth support, but since there's no instruction prompt, you're only masking the :
space = tokenizer(" ", add_special_tokens = False).input_ids[0]
tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[5]["labels"]])
Yeah, setting the label to -100
like this causes it to get set to the same value as is used for "causal masking" which means it gets ignored for the loss calculations, but still get used for the attention mechanism (the attention_mask
can be used for padding tokens to both ignore for the gradient calculation and make the tokens effectively "invisible" to the attention mechanism, but that's not what we want here).
Extract all paragraphs that are between 200 and 2000 characters
I like this idea, that's actually a really simple way to get rid of the junk.
Yeah, I found you can go smaller but the 50-100 character paragraphs in isolation give so little leading context that they aren't likely to be very useful, and by choosing ~200 characters you 100% remove all the useless junk like page numbers, tables of content, etc.
The reason for setting an upper limit is that things like markdown quotations using >
characters can create long run-on "paragraphs" that are really several paragraphs joined.
Randomly shuffle all the 1.1M paragraph + "\n\n" + chunks and concatenate them to use as training data.
So this would also teach the model to end every turn with \n\n
I'm hoping it will just learn to end every paragraph with \n\n
as it's not actually getting any loss calculated for the following <EOS>
token and it should just appear similar to training on larger texts that the model just happens to only see the first paragraph of.
things like names are only high-entropy when you first encounter them; after seeing the name(s) at the start of a story they become low-entropy for the remainder of the sequence...
I've read your post a few times, but I'm not understanding why/how this part would work?
Imagine I give you several chapters of a book to read. If you learn the protagonist is called "Tom" in chapter 1 then the point where you learn his name there could be a huge range of possible names (very high entropy), but as soon as you know his name is "Tom" then the range of valid names drops to just a single possibility (very low entropy).
If these several chapters can fit in a context of 16k or 32k tokens then each time you are about to generate the name "Tom" you aren't really going to get any gradient information from it as the model will be near 100% correct.
On the other hand if you mix these same chapters up with 1000 other books' chapters, and then force the model to look at just a single paragraph (or possibly handful of paragraphs) then the model will be left guessing much more and have to use the very sparse preceeding context to guess the valid range of names based on whatever clues it can glean from it (ie: locale, sex, other nouns, etc).
This is quite an interesting article on prompt masking / prompt weighting:
(just open in an incognito tab if it won't show - it's pretty rare I ever find anything useful one Medium, but this is one rare case)
If this works then I'm actually most excited about adding masked metadata before each paragraph, as IMO that has the ability to really start to be useful and goes right back to the original idea of using Multiplicative-LoRAs as a kind of "conditional control vector" that can add a (signed) direction (from lora_B
) only if it detects another direction (from the corresponding vector in lora_A
).
So the hope would be one or more vectors in lora_A
would learn to pick out hidden states that are relevant to a given bit of metadata (like author name, genre, era, and so on) and then bias the output of the LLM using the corresponding vector in lora_B
to make it closer to what we want...
It could even be a way to recover some of the ability of newer/better LLMs that have been trained on filtered or synthetic data - most "smart" models still have the ability to write like anyone if they can continue on from a real example.
I should know by tomorrow if it has any potential, as currently training on top of command-r:32b
(new version) which is more prone to sloppy writing...
I just need to be careful of overfitting though, as 40% of my data has been pruned away and now only have around ~100M tokens, and even a rank-16 LoRA on command-r:32b
is ~10M trainable parameters... I don't want to reject this method thinking it's broken, but later find it was because of overfitting! So back to using a more conservative rank, lora_dropout and weight_decay to hopefully mitigate the chance of this.
It is definitely learning something:
but will likely be very conservative changes to the output if it isn't broken.
I've just noticed some interesting stuff about the Cohere tokeniser:
https://huggingface.co/CohereForAI/c4ai-command-r-v01/blob/main/tokenizer_config.json
{
"add_bos_token": true,
"add_eos_token": false,
"add_prefix_space": false,
"added_tokens_decoder": {
"0": {
"content": "<PAD>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"1": {
"content": "<UNK>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"2": {
"content": "<CLS>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"3": {
"content": "<SEP>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"4": {
"content": "<MASK_TOKEN>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"5": {
"content": "<BOS_TOKEN>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"6": {
"content": "<EOS_TOKEN>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"7": {
"content": "<EOP_TOKEN>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"255000": {
"content": "<|START_OF_TURN_TOKEN|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"255001": {
"content": "<|END_OF_TURN_TOKEN|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"255002": {
"content": "<|YES_TOKEN|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"255003": {
"content": "<|NO_TOKEN|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"255004": {
"content": "<|GOOD_TOKEN|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"255005": {
"content": "<|BAD_TOKEN|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"255006": {
"content": "<|USER_TOKEN|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"255007": {
"content": "<|CHATBOT_TOKEN|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"255008": {
"content": "<|SYSTEM_TOKEN|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"bos_token": "<BOS_TOKEN>",
"eos_token": "<|END_OF_TURN_TOKEN|>",
They used an actual <EOS_TOKEN>
(and <EOP_TOKEN>
) token during pre-training, but then it got switched to "eos_token": "<|END_OF_TURN_TOKEN|>"
during fine-tuning.
Also the use of <CLS>
, <SEP>
and <MASK>
during pre-training likely means it was trained (at least partly) using non-causal data (ie: like BERT where it gets to see the future tokens and has to fill in the masked/middle tokens):
https://huggingface.co/docs/transformers/en/main_classes/tokenizer
https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_1/
It looks like llama3
might have done something similar with its tokeniser:
<|end_of_text|>
Model will cease to generate more tokens. This token is generated only by the base models.
<|eom_id|>
End of message. A message represents a possible stopping point for execution where the model can inform the executor that a tool call needs to be made. This is used for multi-step interactions between the model and any available tools. This token is emitted by the model when the Environment: ipython instruction is used in the system prompt, or if the model calls for a built-in tool.
<|eot_id|>
End of turn. Represents when the model has determined that it has finished interacting with the user message that initiated its response. This is used in two scenarios:
at the end of a direct interaction between the model and the user
at the end of multiple interactions between the model and any available toolsThis token signals to the executor that the model has finished generating a response.
This makes me wonder if we can still use these tokens for fine-tuning if we set the labels to -100
?
I'm gonna test using each of these:
<SEP>
<EOP_TOKEN>
\n
+<EOS_TOKEN>
\n
+<|END_OF_TURN_TOKEN|>
\n
+\n
+<EOS_TOKEN>
\n
+\n
+<|END_OF_TURN_TOKEN|>
to deliminate the paragraphs (with the label set to -100
), and see what it does to the losses for command-r:32b
(I'm currently running \n
+ \n
+ <|END_OF_TURN_TOKEN|>
).
I don't think using <EOS_TOKEN>
or <|END_OF_TURN_TOKEN|>
without any new lines prepended makes much sense, but from reading the paper (which I re-linked below after my post above vanished) the use of <EOP_TOKEN>
and <SEP>
are worth trying.
One of my posts just vanished above, but in it I linked these two:
https://arxiv.org/abs/2004.02251
and said it look like the Cohere models' order of token ID numbers makes it look like they might have first pre-trained bi-directionally, then pre-trained causally, then finally fine-tuned.
If this works then I'm actually most excited about adding masked metadata before each paragraph, as IMO that has the ability to really start to be useful and goes right back to the original idea of using Multiplicative-LoRAs as a kind of "conditional control vector" that can add a (signed) direction (from
lora_B
) only if it detects another direction (from the corresponding vector inlora_A
).So the hope would be one or more vectors in
lora_A
would learn to pick out hidden states that are relevant to a given bit of metadata (like author name, genre, era, and so on) and then bias the output of the LLM using the corresponding vector inlora_B
to make it closer to what we want...It could even be a way to recover some of the ability of newer/better LLMs that have been trained on filtered or synthetic data - most "smart" models still have the ability to write like anyone if they can continue on from a real example.
Glad to see you're giving this model a go for us 24gb and below users :-)
If this works then I'm actually most excited about adding masked metadata before each paragraph, as IMO that has the ability to really start to be useful and goes right back to the original idea of using Multiplicative-LoRAs as a kind of "conditional control vector" that can add a (signed) direction (from
lora_B
) only if it detects another direction (from the corresponding vector inlora_A
).So the hope would be one or more vectors in
lora_A
would learn to pick out hidden states that are relevant to a given bit of metadata (like author name, genre, era, and so on) and then bias the output of the LLM using the corresponding vector inlora_B
to make it closer to what we want...It could even be a way to recover some of the ability of newer/better LLMs that have been trained on filtered or synthetic data - most "smart" models still have the ability to write like anyone if they can continue on from a real example.
Glad to see you're giving this model a go for us 24gb and below users :-)
Well, if I can get this working properly then I think it should work with smaller models too:
I think the reason @tdrussell 's "Instruct-Storywriter" method didn't work well on small models is because they got a huge drop in loss compared to larger models, whereas this method of using a bunch of randomised paragraphs gets a similar loss for all models, and big models can't rely so much on already having the stories encoded in their weights.
I'm gonna test using each of these:
<SEP>
<EOP_TOKEN>
\n
+<EOS_TOKEN>
\n
+<|END_OF_TURN_TOKEN|>
\n
+\n
+<EOS_TOKEN>
\n
+\n
+<|END_OF_TURN_TOKEN|>
to deliminate the paragraphs (with the label set to
-100
), and see what it does to the losses forcommand-r:32b
(I'm currently running\n
+\n
+<|END_OF_TURN_TOKEN|>
).
After reading the paper I linked above about the use of <SEP>
and <EOP_TOKEN>
:
The most important observation is that, without EOP, the beginning of the generation is more relevant to the end of the input prompt, but the more it generates, the poor quality is. While the generator with EOP can generate multiple paragraphs related to the input with a reasonable ending but each paragraph is more independent than human writings.
(see Appendix B too)
Added to the fact that my paragraphs are all seen in isolation and randomised; I think actually the only ones I need to try now are:
<EOS_TOKEN>
\n
+<EOS_TOKEN>
\n
+\n
+<EOS_TOKEN>
and:
<|END_OF_TURN_TOKEN|>
\n
+<|END_OF_TURN_TOKEN|>
\n
+\n
+<|END_OF_TURN_TOKEN|>
It only take around 20 hours per run so can easily test all of these, but it will be harder to compare the evaluation losses between the different new line variants as the models can probably "cheat" and learn the pattern from earlier examples...
and this bit from the paper:
This observation indicates that GPT2 tends not to generate the EOS following the NL even after fine-tuning, but it can learn better EOS with the help of a new EOP token.
make me think that adding the new lines right before the <EOS>
token might be a bad idea (but not 100% sure if I'm setting the <EOS>
label to -100
).
So next I will try <|END_OF_TURN_TOKEN|>
and <EOS_TOKEN>
(with label set to -100
) as these should be easier to compare.
If this works then I'm actually most excited about adding masked metadata before each paragraph, as IMO that has the ability to really start to be useful and goes right back to the original idea of using Multiplicative-LoRAs as a kind of "conditional control vector" that can add a (signed) direction (from
lora_B
) only if it detects another direction (from the corresponding vector inlora_A
).So the hope would be one or more vectors in
lora_A
would learn to pick out hidden states that are relevant to a given bit of metadata (like author name, genre, era, and so on) and then bias the output of the LLM using the corresponding vector inlora_B
to make it closer to what we want...It could even be a way to recover some of the ability of newer/better LLMs that have been trained on filtered or synthetic data - most "smart" models still have the ability to write like anyone if they can continue on from a real example.
Glad to see you're giving this model a go for us 24gb and below users :-)
Well, if I can get this working properly then I think it should work with smaller models too:
I think the reason @tdrussell 's "Instruct-Storywriter" method didn't work well on small models is because they got a huge drop in loss compared to larger models, whereas this method of using a bunch of randomised paragraphs gets a similar loss for all models, and big models can't rely so much on already having the stories encoded in their weights.
Mate, that's awesome! Can't wait to see it.
All this is getting way too complicated and it's unclear exactly what the effect of all these different ways of breaking paragraphs are going to have on an instruction-tuned model...
So... I'm just gonna generate my data as before:
Paragraph 1
<EOS>Paragraph 2
<EOS>Paragraph 3
.
.
.
<EOS>Paragraph N-1
<EOS>Paragraph N
<EOS>
Then tokenise this with the <EOS>
tokens ensuring each paragraph with the 2 trailing newlines gets tokenised as a whole.
Then use this to just output huge sequences of random paragraphs to train on:
<BOS>Paragraph 1
Paragraph 2
Paragraph 3
.
.
.
Paragraph N-1
Paragraph N
<EOS>
<EOS>
<EOS>
and completely mask out the <EOS>
tokens in the same way as <PAD>
would be.
It will likely confuse the model somewhat, but may actually be less confusing that attempting to use all these breaking tokens for an instruction-tuned model and the distribution of newlines in real stories should be retained.
(If it does cause the model to not be able to output any special tokens, then I can deal with that by using a second dataset that is passed through the chat template but then mask out everything except the special tokens. Even if the second dataset is full of horrible slop-ridden stories; it will still be able to hopefully fix the frequencies of special tokens if needed....)
It's a bit of a dodgy hack, but I've found a way to avoid screwing up the frequencies of the special tokens:
"added_tokens_decoder": {
"0": {
"content": "<PAD>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"1": {
"content": "<UNK>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"2": {
"content": "<CLS>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"3": {
"content": "<SEP>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"4": {
"content": "<MASK_TOKEN>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"5": {
"content": "<BOS_TOKEN>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"6": {
"content": "<EOS_TOKEN>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"7": {
"content": "<EOP_TOKEN>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"255000": {
"content": "<|START_OF_TURN_TOKEN|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"255001": {
"content": "<|END_OF_TURN_TOKEN|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"255002": {
"content": "<|YES_TOKEN|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"255003": {
"content": "<|NO_TOKEN|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"255004": {
"content": "<|GOOD_TOKEN|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"255005": {
"content": "<|BAD_TOKEN|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"255006": {
"content": "<|USER_TOKEN|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"255007": {
"content": "<|CHATBOT_TOKEN|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"255008": {
"content": "<|SYSTEM_TOKEN|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"255009": {
"content": "<|USER_0_TOKEN|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"255010": {
"content": "<|USER_1_TOKEN|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"255011": {
"content": "<|USER_2_TOKEN|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"255012": {
"content": "<|USER_3_TOKEN|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"255013": {
"content": "<|USER_4_TOKEN|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"255014": {
"content": "<|USER_5_TOKEN|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"255015": {
"content": "<|USER_6_TOKEN|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"255016": {
"content": "<|USER_7_TOKEN|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"255017": {
"content": "<|USER_8_TOKEN|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"255018": {
"content": "<|USER_9_TOKEN|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"255019": {
"content": "<|EXTRA_0_TOKEN|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"255020": {
"content": "<|EXTRA_1_TOKEN|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"255021": {
"content": "<|EXTRA_2_TOKEN|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"255022": {
"content": "<|EXTRA_3_TOKEN|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"255023": {
"content": "<|EXTRA_4_TOKEN|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"255024": {
"content": "<|EXTRA_5_TOKEN|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"255025": {
"content": "<|EXTRA_6_TOKEN|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"255026": {
"content": "<|EXTRA_7_TOKEN|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"255027": {
"content": "<|EXTRA_8_TOKEN|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"255028": {
"content": "<|NEW_FILE|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"255029": {
"content": "<|BEGINNING_OF_PREFIX_FIM_TOKEN|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"255030": {
"content": "<|BEGINNING_OF_MIDDLE_FIM_TOKEN|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"255031": {
"content": "<|BEGINNING_OF_SUFFIX_FIM_TOKEN|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"255032": {
"content": "<|END_OF_MIDDLE_FIM_TOKEN|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"255033": {
"content": "<|EXTRA_9_TOKEN|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
}
},
by hacking the Triton kernel:
@triton
.heuristics({
"DO_LOGIT_SCALING": lambda args: args["DO_LOGIT_SCALING"],
})
@triton
.jit
def _cross_entropy_backward(
logits_ptr, logits_row_stride,
dloss_ptr, dloss_row_stride,
logsumexp_ptr,
labels_ptr,
VOCAB_SIZE : tl.constexpr,
BLOCK_SIZE : tl.constexpr,
DO_LOGIT_SCALING : tl.constexpr,
LOGIT_SCALE : tl.constexpr,
):
"""
CE_i = -y log(P) = y * (log[sum(exp(x))] - x)
dC/dx = d/dx (y * log[sum(exp(x))] - x * y)
From https://en.wikipedia.org/wiki/LogSumExp
d/dx logsumexp = exp(x) / sum(exp(x)) = softmax(x)
dC/dx = y * exp(x) / sum(exp(x)) - d/dx (x * y)
dC/dx = y * exp[ log[exp(x) / sum(exp(x))] ] using x = exp(log(x)) trick
dC/dx = y * exp[x - logsumexp] - d/dx (x * y)
If y == 0: dC/dx = 0
If y == 1 and x == label: dC/dlabel = exp[x - logsumexp] - 1
If y == 1 and x != label: dC/dx = exp[x - logsumexp]
"""
row_idx = tl.program_id(0)
block_idx = tl.program_id(1)
logits_ptr += row_idx * logits_row_stride.to(tl.int64)
dloss_ptr += row_idx * dloss_row_stride
col_offsets = block_idx*BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
mask = col_offsets < VOCAB_SIZE
label_idx = tl.load(labels_ptr + row_idx).to(tl.int32)
if label_idx != -100:
dloss = tl.load(dloss_ptr)
else:
dloss = 0.0
x = tl.load(logits_ptr + col_offsets, mask = mask, other = -float("inf")).to(tl.float32)
if DO_LOGIT_SCALING:
# d/dx [s * x] = s
x = LOGIT_SCALE * x
pass
logsumexp = tl.load(logsumexp_ptr + row_idx)
y = tl.exp(x - logsumexp)
y = tl.where(
col_offsets == label_idx,
y - 1.0, # exp(x - logsumexp) - 1
y, # exp(x - logsumexp)
)
#######################################################
# Zero out the gradients for the Cohere special tokens.
y = tl.where(
(col_offsets <= 7) | (col_offsets >= 255000),
0.0,
y,
)
#######################################################
# If y == 0: dC/dx = 0 ==> we already masked it to be = 0, so dloss = 0.
if DO_LOGIT_SCALING:
# d/dx [s * x] = s
y = LOGIT_SCALE * y
pass
tl.store(logits_ptr + col_offsets, dloss * y, mask = mask)
pass
so that gradient information isn't backed up for these tokens.
This should fix the problems regarding the frequencies of these going slowly to zero due to having none of them in your training data!
Now I just need to see what happens when we train on these massive files of:
<BOS>paragaph1
paragraph2
paragraph3
I've found out using:
https://huggingface.co/spaces/Xenova/the-tokenizer-playground
that the above tokenises to:
[5, 35, 138854, 37, 2385, 1786, 16599, 24, 206, 206, 95337, 25, 206, 206, 95337, 26]
with 206
being the newlines.
I'm hoping by keeping these newlines we DO actually bias the frequency of these to be closer to actual authors' writing style, but if this fails I also can zero their gradient if needs be.
Fingers crossed this works!
Sorry for the lack of updates, but I have still been progressing slowly with this:
- I'm still getting the weird "extra step" at the end of every training run now, but unless I use cosine annealed schedule it doesn't seem to make any difference.
- I've found a much better way to initialise the LoRAs which let me run projected gradient descent on
lora_A
so it stays on the surface of a unit sphere, and then use weight-decay only onlora_B
.
I'll post more details and hopefully the v1.0
of command-r:32b
before the new year.
I haven't tested it yet, but the new initialization / optimisation may let me bump the Entropy up even further than I could before, but for now I'm just using stock Cross-Entropy loss and no attempt to increase Entropy until I get the hyper-parameters dialed in properly...
I'm still running on the 1.1M random paragraphs dataset and using the "hack" I posted above to avoid the special tokens getting nerfed:
https://github.com/tdrussell/qlora-pipe/discussions/41
I'll be buggered if I can make this work in pytorch without using 10GB extra VRAM (for no apparent reason - even using "chunking"???), but the Triton kernel modification works...
If anybody has any suggestions I'd be very grateful, as currently this dodgy hack will mean the code needs to be edited for every different model :/
Merry Christmas!
Merry Christmas!
https://huggingface.co/spaces/Xenova/the-tokenizer-playground
This looks useful. I've got a tokenizer issue to investigate myself. I've been using the standard eg:
from transformers import AutoTokenizer
writer_tokenizer = AutoTokenizer.from_pretrained("gghfez/Writer-Large-2411-v2.1")
print(writer_tokenizer.encode("""<BOS>paragaph1
paragraph2
paragraph3"""))
So it looks like for command-r, 206 is 1 linefeed and 2126 is 2 linefeeds.
If anybody has any suggestions I'd be very grateful, as currently this dodgy hack will mean the code needs to be edited for every different model :/
Sorry, what you're doing is beyond my level right now.
Merry Christmas!
Not related to creative writing, but the new QWQ:72B
model is insanely impressive:
- I gave it an obscure picture of train line map I took at a museum a few months ago: horrible photo, glare reflecting off the perspex in front of it, etc. Then asked it to estimate the date and it absolutely nailed it by looking at the place names, the dates the lines were created and cut, the style of the fonts, and so on!
- I gave it a picture of my brother and his wife sitting in front of a waterfall in New Zealand and it looked at the foliage, lighting, water colour and so on to narrow it down and actually got the exact place!
- I gave it a picture of my confusing 3-phase electric meter and asked for the reading, and it managed to ignore all the distractions and read the exact value!
I think GeoGuessr will have to start working on their anti-cheat as it's likely better than 99% of the population!!!
Merry Christmas all! Have a great day!
Just starting to upload the v1.0
creative writer models, but noticed you can only have 100GB private storage now... Due to having such poor upload bandwidth I usually make them private until they are finished, but not sure what will happen now?
Just starting to upload the
v1.0
creative writer models, but noticed you can only have 100GB private storage now... Due to having such poor upload bandwidth I usually make them private until they are finished, but not sure what will happen now?
I don't 'think' those are enforced limits yet? I guess we will find out.
Can confirm, it's not enforced yet (thank God). Earlier today I pushed (private):
- A Llama-3.3-Instruct-70b finetune @ fp16
- A, Llama-3.2-90b-vision with that ^ 70b merged into it @ fp16
- Several LoRA checkpoints, a couple of tokenizers and a some mistral-large hidden_state files.
Oh and I gguf-my-repo'd a Qwen2.5-32b finetune privately.
All worked fine.
Tried having Flux give me pictures about disk storage space police... it did not understand the assignment... fixated on 'space police' :D
Tried having Flux give me pictures about disk storage space police... it did not understand the assignment... fixated on 'space police' :D
Cinematic shot from retro 80s cop movie. The room is full of hard drives. Like a lot of hard drives. The hard drives are scattered everywhere. Piles of hard drives can be seen in the background. Two police officers with pistols are busting through the door. Both of the police officers are wearing uniforms with text "HuggingFace" and a big hugging face emoji.
I might try to upload all the cohere
-based models before opening the repos then, as the 32b
is likely to be not that great compared to the 35b
and the 104b
based off the old command-r-plus
model (I may even try to create an 8b
using aya-expanse-8b
[or even aya-23-8B
]). So:
creative-writer-v1.0-32b
creative-writer-v1.0-35b
creative-writer-v1.0-104b
- (and possibly)
creative-writer-v1.0-8b
I've also found that increasing the Entropy is best done via a second epoch using the same training data (or otherwise the momentum-based optimisers like Adam
massively overshoot and/or try to make the norm of the hidden state smaller to "cheat"). I'm going to call these models "creative-writer-plus" and use around the same value as I used for the bravo
experimental models, as this seems to give a good balance between increasing the Entropy vs making the model not follow instructions quite as well. So:
creative-writer-plus-v1.0-32b
creative-writer-plus-v1.0-35b
creative-writer-plus-v1.0-104b
- (and possibly)
creative-writer-plus-v1.0-8b
I'm actually really happy with the hyper-parameters and extra code used to train these now, and it will likely take me 1-2 days to write the README.MD
file.
So on top of all I wrote about the use of "Multiplicative-LoRAs" (which are explained in the README.MD
of the experimental models), here is a rough draft of what I am now doing:
lora_A
is initialised using torch.nn.init.orthogonal_ so that all 16 rows form a semi-orthogonal matrix (ie: "signed direction detection" vectors of unit length).lora_B
is initialised to be all zeros as usual.
After each step of the optimiser, I then use this custom code:
def apply_lora_norm_regularization(model, config, current_lr):
assert config['lora_alpha'] == config['lora_rank'], "Used `nn.init.orthogonal_` so must have: alpha = r"
weight_decay = config['optimizer'].get('lora_weight_decay', 0.0)
lora_B_scaler = 1.0 - (current_lr / config['optimizer']['lr']) * weight_decay if weight_decay > 0 else 1.0
for name, param in model.named_parameters():
if 'lora_A' in name:
# Project each row of lora_A back onto the surface of the unit ball
with torch.no_grad():
param.div_(param.norm(p=2, dim=1, keepdim=True)) # TODO: Check this works for bfloat16
elif 'lora_B' in name and lora_B_scaler < 1.0:
# Shrink each column of lora_B back towards the origin
with torch.no_grad():
param.mul_(lora_B_scaler) # TODO: Check this works for bfloat16
To:
- Perform projected gradient descent (sorry no Wiki page for it?) on
lora_A
to enforce the unit length (but not the semi-orthogonality; although this is possible and might be worth trying in the future - see section 2.1 of Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks). It might also be worth projecting the gradients onto the tangent space as detailed in the top reply to this post, but I can't easily try this without altering lots of scary Deepspeed stuff inqlora-pipe
(plus I'm not 100% convinced it is actually the correct thing to do anyway...). - Perform fully decoupled weight decay on
lora_B
ONLY.
This essentially makes each pair of vectors lora_A[i] x lora_B[i]
in the outer-product act like the "conditional control vectors" I envisioned when I set out to do this about 6 months ago...
On top of this, I now have the following changes to the dataset creation process:
- I start with ~185M tokens extracted from ~1000 books.
- I then extract all the paragraphs, which are then heuristically filtered to leave approximately 1.1M definite/clean paragraphs (~135M tokens in total).
- The 1.1M paragraphs are then randomly concatenated into blocks of
8192
tokens, separated by double newlines, and right-padded with a small amount ofEOS
tokens where needed, eg:
<BOS>paragraph1
paragraph2
paragraph3
.
.
.
paragraph n-1
paragraph n
<EOS><EOS><EOS><EOS><EOS>
The EOS
tokens used for padding all have their labels set to -100
and their attention_mask
flags set to 0
(to avoid training on them).
Also, to avoid the slow decay of the special token probabilities; all special tokens have their gradient set to zero in the backward function of the Triton kernel (which is approximately equivalent to assuming that the output probability value of the special tokens is always exactly what the target for these is, and that we correct the 1-hot targets to account for these so the target vector sums to unity).
Finally we use "Focal Loss Star" loss function with gamma = 1.01
to maintain the entropy during training, eg:
The "plus" variants of the models will just use the same config file and dataset but use the "Focal Loss Star" loss function with gamma = 1.1
to adjust the merged LoRA model from the previous stage.
The only way forward now would be to start to scale up the dataset size, as even just using a rank-16 LoRA for the down_proj
matrices only; the amount of tokens per tunable parameter is pretty low (eg: around 15 tokens/parameter for the 32b
and 35b
models, and less than 5 tokens/parameter for the bigger models). There is probably a (very) significant amount of noise getting added to the 8192
+ dimension vectors compared to the control-vector training where I used several orders of magnitude more tokens per tunable parameter compared to this...
I think this could be accomplished by leaving the config file completely unchanged (ie: weight_decay
, learning_rate
, etc) and just increasing the batch size as we get more samples in the future. Increasing the rank probably doesn't make much sense either, and I'm only using rank-16 as this seems to be the smallest value I could get working without too much risk of getting stuck in saddle points... The possible combinations of all the rank-16 vectors in all the n-1 layers is really high anyway; it's the low tokens per tunable parameter that is likely going to be the biggest problem.
I have a shit-ton of books that I can create the datasets from, but even with 6x A6000s it's not really feasible to scale up much more and it would need a rented GPU cluster to train for the larger 70B+
parameter models.
One final thing: don't get your hopes up too much for models like qwen
and llama-3
- this process can (subjectively) make already good creative-writing model (slightly) better and/or increase their Entropy back towards real natural language, but it can't make a horrible creative-writing model into a good creative-writing model... :)
It's also unlikely to massively improve the dreaded "slop attractor phrases" as it's only considering single-token Entropy and never rolling out to the generation-space where some of these live (it will probably help somewhat with this though).
Tried having Flux give me pictures about disk storage space police... it did not understand the assignment... fixated on 'space police' :D
Cinematic shot from retro 80s cop movie. The room is full of hard drives. Like a lot of hard drives. The hard drives are scattered everywhere. Piles of hard drives can be seen in the background. Two police officers with pistols are busting through the door. Both of the police officers are wearing uniforms with text "HuggingFace" and a big hugging face emoji.
Love it!
One final thing: don't get your hopes up too much for models like
qwen
andllama-3
- this process can (subjectively) make already good creative-writing model (slightly) better and/or increase their Entropy back towards real natural language, but it can't make a horrible creative-writing model into a good creative-writing model... :)It's also unlikely to massively improve the dreaded "slop attractor phrases" as it's only considering single-token Entropy and never rolling out to the generation-space where some of these live (it will probably help somewhat with this though).
Making comparatively good models better works for me. Need to put some of these full precision models on ice just in case. Though, I guess there isn't just one copy of anything anymore. :D
Cinematic shot from retro 80s cop movie. The room is full of hard drives. Like a lot of hard drives. The hard drives are scattered everywhere. Piles of hard drives can be seen in the background. Two police officers with pistols are busting through the door. Both of the police officers are wearing uniforms with text "HuggingFace" and a big hugging face emoji
That's awesome, you're good at prompting diffusion models.
never rolling out to the generation-space where some of these live
What is the "generation space"?
I'm actually really happy with the hyper-parameters and extra code used to train these now, and it will likely take me 1-2 days to write the README.MD file.
That's going to be a goldmine of information!
creative-writer-v1.0-32b
I'm hoping this one is good given we get GQA and it doesn't write random Russian/Chinese letters when quantized :)
never rolling out to the generation-space where some of these live
What is the "generation space"?
If the phrase "shivers down her spine" isn't in the training corpus, or even in the training corpus but way less than the model will use if left to write its own text, then it will never get the feedback needed to stop it outputting this... The only way it can get some feedback is via seeing lots of other things that are not "shivers down her spine" and hoping that these drown it out.
This is opposed to the case where you allow it to roll out further in it's own "generation space" and use something like DPO to give feedback on its own generations.
The creative-writer-v1.0-35b
is really interesting! It perhaps has picked up one some weird formatting from the paragraphs training data (eg: starting paragraphs indented by a space or using triple newlines), but it only does this is you just start off with a blank slate, and not always. If you start it off with a bit of the title or edit the opening phrase it seems better and I think it will be fine if you stop and correct the first couple of paragraphs if needed...
BUT: it seems to be very creative and much more interesting that the creative-writer-v1.0-32b
fine-tune; using more internal monologues, etc.
I'm really excited now to see what creative-writer-v1.0-104b
turns out like as the 35b
model always was a bit flaky with the formatting before (as
@gghfez
found with his BSG story prompt).
Now you're just teasing...
Now you're just teasing...
It's uploading now (I already have the 32b
uploaded but set private currently).
I've also just figured out what probably screwed the mistral-large
training run! The default for Deepspeed when you have 3 machines with 2 pipeline stages is to distribute like this:
Using topology: {ProcessCoord(pipe=0, data=0): 0, ProcessCoord(pipe=0, data=1): 1, ProcessCoord(pipe=0, data=2): 2, ProcessCoord(pipe=1, data=0): 3, ProcessCoord(pipe=1, data=1): 4, ProcessCoord(pipe=1, data=2): 5}
Which probably makes sense for GPU clusters doing full fine-tuning:
class PipeDataParallelTopology(ProcessTopology):
""" A topology specialization for hybrid data and pipeline parallelism.
Uses data parallelism on the last dimension to encourage gradient
reductions to use high-bandwidth intra-node links and lower-volume
pipeline communications to use low-bandwidth inter-node links.
"""
def __init__(self, num_pp, num_dp):
super().__init__(axes=['pipe', 'data'], dims=[num_pp, num_dp])
but when you are training LoRAs it is a disaster... Each machine has to use the network to pass on a 12288
sized vector of floats for each token, but then every 10 minutes or so passes around ~10 x 12288
sized vector of floats to do the reduce step for the gradients (using the NVLink or PCI-bus).
This is completely backwards for what we want, and I've fixed it now:
class CustomPipeDataParallelTopology(ProcessTopology):
"""A topology specialization for hybrid data and pipeline parallelism with swapped axes."""
def __init__(self, num_pp, num_dp):
# Swap the axes and dims to change the rank mapping
super().__init__(axes=['data', 'pipe'], dims=[num_dp, num_pp])
Which gives:
Using topology: {ProcessCoord(data=0, pipe=0): 0, ProcessCoord(data=0, pipe=1): 1, ProcessCoord(data=1, pipe=0): 2, ProcessCoord(data=1, pipe=1): 3, ProcessCoord(data=2, pipe=0): 4, ProcessCoord(data=2, pipe=1): 5}
So now I have 3 copies of the model (the data
axis above) spread over the 3 machines, and each machine has the model split between their 2 GPUs (pipe
axis above).
Using a 10gbit connection the data use probably didn't matter that much (but it still seems dumb to pass several TB of data through the network and a few MB through the NVLink bridge...), but I think this may have caused some problem with saving the checkpoints due to having the the LoRA spread weirdly through multiple machines like it was...
Looks to be working and possible quite a bit faster too:
GPU-SERVER-1: [2024-12-28 17:58:05.312] [INFO] [qlora-pipe] step: 1 / 562 loss: 2.8461 iter time (s): 533.796 samples/sec: 0.056 eta: 83h10m
(IIRC, the broken mistral-large
run was 200 hours for 185M tokens through 123B parameters, and this is 83 hours for 135M tokens through 104B parameters)
So should have the results in around 4.5 days from now!
Now you're just teasing...
It's uploading now (I already have the
32b
uploaded but set private currently).
I've also just figured out what probably screwed the
mistral-large
training run! The default for Deepspeed when you have 3 machines with 2 pipeline stages is to distribute like this:Using topology: {ProcessCoord(pipe=0, data=0): 0, ProcessCoord(pipe=0, data=1): 1, ProcessCoord(pipe=0, data=2): 2, ProcessCoord(pipe=1, data=0): 3, ProcessCoord(pipe=1, data=1): 4, ProcessCoord(pipe=1, data=2): 5}
Which probably makes sense for GPU clusters doing full fine-tuning:
class PipeDataParallelTopology(ProcessTopology): """ A topology specialization for hybrid data and pipeline parallelism. Uses data parallelism on the last dimension to encourage gradient reductions to use high-bandwidth intra-node links and lower-volume pipeline communications to use low-bandwidth inter-node links. """ def __init__(self, num_pp, num_dp): super().__init__(axes=['pipe', 'data'], dims=[num_pp, num_dp])
but when you are training LoRAs it is a disaster... Each machine has to use the network to pass on a
12288
sized vector of floats for each token, but then every 10 minutes or so passes around ~10 x12288
sized vector of floats to do the reduce step for the gradients (using the NVLink or PCI-bus).This is completely backwards for what we want, and I've fixed it now:
class CustomPipeDataParallelTopology(ProcessTopology): """A topology specialization for hybrid data and pipeline parallelism with swapped axes.""" def __init__(self, num_pp, num_dp): # Swap the axes and dims to change the rank mapping super().__init__(axes=['data', 'pipe'], dims=[num_dp, num_pp])
Which gives:
Using topology: {ProcessCoord(data=0, pipe=0): 0, ProcessCoord(data=0, pipe=1): 1, ProcessCoord(data=1, pipe=0): 2, ProcessCoord(data=1, pipe=1): 3, ProcessCoord(data=2, pipe=0): 4, ProcessCoord(data=2, pipe=1): 5}
So now I have 3 copies of the model (the
data
axis above) spread over the 3 machines, and each machine has the model split between their 2 GPUs (pipe
axis above).Using a 10gbit connection the data use probably didn't matter that much (but it still seems dumb to pass several TB of data through the network and a few MB through the NVLink bridge...), but I think this may have caused some problem with saving the checkpoints due to having the the LoRA spread weirdly through multiple machines like it was...
Any idea when you'll be releasing the 32b to the public? I'm curious to know, if you dataset has any Sci-Fi in it?
Yeah, this looks to be around 50% faster! The old topology was probably even worse for LoRAs than it seemed as each of the 3 copies were probably all contending for the network at the same time to try to pass all their outputs between all the different possible combinations of stages :/
I think the next mistral-large
training run will only take around 5 to 5.5 days to complete (instead of 9-10 days).
Any idea when you'll be releasing the 32b to the public?
I'll open them both up with a blank readme page tomorrow - but hold off on any judgments until we see what happens for the big models, as from past testing the smaller models are much more susceptible to being damaged by this process!
I'm curious to know, does you dataset have any Sci-Fi in it?
Yeah, there are probably around 10-15% Sci-F books in the training data, but I wouldn't expect it to be all that important:
- This isn't really a "fine-tune" that compares to what other people upload (ie: that can specifically "learn" things from the dataset used to train it - it's just a jumbled up bunch of paragraphs!).
- Think of it more as a "recalibration" to try to get the writing style back to what a more "normal" pre-LLM authors' style would be.
Hopefully it will be clearer what I mean when we get the first big model finished :)
Any idea when you'll be releasing the 32b to the public?
I'll open them both up with a blank readme page tomorrow - but hold off on any judgments until we see what happens for the big models, as from past testing the smaller models are much more susceptible to being damaged by this process!
I'm curious to know, does you dataset have any Sci-Fi in it?
Yeah, there are probably around 10-15% Sci-F books in the training data, but I wouldn't expect it to be all that important:
- This isn't really a "fine-tune" that compares to what other people upload (ie: that can specifically "learn" things from the dataset used to train it - it's just a jumbled up bunch of paragraphs!).
- Think of it more as a "recalibration" to try to get the writing style back to what a more "normal" pre-LLM authors' style would be.
Hopefully it will be clearer what I mean when we get the first big model finished :)
Sounds good! It'll be fun for me to have a mess about with it! Thanks man.
I think the bigger the drop in loss, the more the risk of damage:
So we should (hopefully) see the magenta line drop much less.
When I tried to run this on the really small models; I got huge drops and the models were pretty much lobotomised as a result :/
This paper explains this phenomenon a bit too:
https://arxiv.org/abs/2405.09673
I'm already very heavily regularising the model though:
far (far, far!) more than 99.9% of fine-tunes here and this can be reduced even more if needed (at the cost of the fine-tuning process having less and less effect...).
It's unlocked now:
https://huggingface.co/jukofyork/creative-writer-32b-preview
Looks like they are enforcing the limits now:
403 Forbidden: Private repository storage limit reached, please upgrade your plan to increase your private storage limit.
So had no choice but to unlock it early to upload the 35b
model :/
Nice!
I just got temporary access to DeepSeek v3 FP8 and I was trying to find the prompts that were being used for our test runs? Since the original thread is toast I don't see them.
So had no choice but to unlock it early to upload the 35b model :/
They must be rolling that out in stages??
So I'm guessing you don't want quants uploaded?
I guess I'll be setting public a lot of broken model experiments soon (eg. a WizardLM2-8x22b hack which only responds in Spanish lol)
I just got temporary access to DeepSeek v3 FP8 and I was trying to find the prompts that were being used for our test runs? Since the original thread is toast I don't see them.
They were in the old version's thread of doom. I've got some of them in my history from testing:
https://pastebin.com/pdrqEB1M
Password: NoSlop4U
There's also this:
https://huggingface.co/jukofyork/creative-writing-control-vectors-v3.0/discussions/1
They were in the old version's thread of doom. I've got some of them in my history from testing:
Thanks! Were the outputs Temp 0 , Temp 1 or something else?
I've only got that info for one of them:Temperature=0, no system prompt
and looks like using the command-r chat template
So I'm guessing you don't want quants uploaded?
Feel free to do whatever with them - it's because I'm uploading each file 1 at a time that I would rather not open the repo until it's done or else people might miss some of the files :/
It's good (testing unquantized) side-by-side with the official cohere model. 10 prompts so far.
Cohere model has given me Elara in 6 of them. Yours hasn't given me any of the slop I usually notice.
Even when it wants to write some of the slop-like phrases such as "her hand hovering over the hilt of her dagger" (slop) it writes it differently, like starting a new sentence and writing "She clutched her dagger".
It sometimes gives me these humorous disclaimers too (I get that this probably wasn't intentional but it's funny):
"""
Author's Note: All prose, names, and terminology are entirely fictional and any coincidental similarities to real-world counterparts are unintended. Absolutely no elves or bards were harmed in the writing of this mock-up novel chapter, although the author accepts no liability for those listening to this chapter who happen to be reading a bedtime story to elves or bards at the same time.
"""
the author accepts no liability for those listening to this chapter who happen to be reading a bedtime story to elves or bards at the same time.
LOL!
It does lose narrative cohesion sometimes though, for example, in 3 stories it changed from first -> third person perspective. And it seems to have a "darkness" kind of bias towards it, almost like when I apply a control-vector.
As it is now, this is already already a very model despite that ^. Normally you'd need a difficult to run model like Mistral-Large or the non-GQA command-r/r+ to get output like this. Huge improvement over the standard 32b version. I'm going to mlx-quant it so I can run it on my mac while I'm away in January.
Thank you for this!
The 35b
is uploaded and open now:
https://huggingface.co/jukofyork/creative-writer-35b-preview
This one seems to have slightly weird/broken formatting if you just try to start stories from a "blank slate", but it seems very creative when it does this (possibly due to using different tokenisation via a leading space?).
And it seems to have a "darkness" kind of bias towards it, almost like when I apply a control-vector.
Yeah, I wasn't sure if I was imagining this but I thought it seemed pretty dark too? I think the 35b
might even be more dark from my brief testing!?
changed from first -> third person perspective
This could be bias in the dataset of books being used.
Author's Note: All prose, names, and terminology are entirely fictional and any coincidental similarities to real-world counterparts are unintended. Absolutely no elves or bards were harmed in the writing of this mock-up novel chapter, although the author accepts no liability for those listening to this chapter who happen to be reading a bedtime story to elves or bards at the same time.
LOL, no idea at all where this could have come from? A rank-16 LoRA couldn't even have encoded that, so it must have been some latent weirdness it's brought to the surface? :O
Thank you for this!
No problem and hopefully the 104b
and mistral-large:123b
will work even better!
There is probably a lot more I can try with the dataset formatting, but really need to see how it works on the larger models before making any more changes to the method.
I'm pretty sure the method of making the gradients zero for the special tokens works well too, and is likely the reason that other guy found that you can train LoRAs on the base model to be applied to the instruct model perhaps?
"""
Author's Note: All prose, names, and terminology are entirely fictional and any coincidental similarities to real-world counterparts are unintended. Absolutely no elves or bards were harmed in the writing of this mock-up novel chapter, although the author accepts no liability for those listening to this chapter who happen to be reading a bedtime story to elves or bards at the same time.
"""the author accepts no liability for those listening to this chapter who happen to be reading a bedtime story to elves or bards at the same time.
LOL that's awesome...
I should reiterate that I wouldn't be too disappointed if these first versions are a little broken - I have full control over the regularisation now and can regularise over the full continuum of possible models if needed, and there are several ways multiple training runs can be combined to smooth out excessive noise if needed too.
and is likely the reason that other guy found that you can train LoRAs on the base model to be applied to the instruct model perhaps?
I asked him why on discord a while back, he said he didn't know and was just experimenting :D
Yeah, I wasn't sure if I was imagining this but I thought it seemed pretty dark too?
Definitely is. It wrote a pretty brutal story when I prompted it to write about 2 giraffes fighting over the last acacia tree during a drought lol.
If not for the cohere license regarding training on outputs, I'd use this in place of what I did for the dataset for my Mistral-Large-2411 writing model (which was to use control-vectors on Apache2 licensed models to generate the synthetic aspects of the dataset)
Quants:
https://huggingface.co/gghfez/jukofyork_creative-writer-32b-preview-exl2-4.5bpw
https://huggingface.co/gghfez/jukofyork_creative-writer-32b-preview-AWQ
https://huggingface.co/bartowski/creative-writer-32b-preview-GGUF
https://huggingface.co/gghfez/jukofyork_creative-writer-35b-preview-AWQ
https://huggingface.co/gghfez/jukofyork_creative-writer-35b-preview-exl2-4.5bpw
lol at these disclaimers :D
I wonder what happened to command-r
This is full precision, with min_p set to 0.2. Same thing happens with the cohere model (full weights, AWQ, exl2 and GGUF) and on openrouter.
I don't remember it having this problem when they released it. I wonder if an update to transformers caused a regression at some point.
lol at these disclaimers :D
These are a weird one - I can't really explain what could have caused two different models to start doing that?!
This is full precision, with min_p set to 0.2. Same thing happens with the cohere model (full weights, AWQ, exl2 and GGUF) and on openrouter.
Yeah, I don't remember it doing that either...
I did notice that the tied input_embedding
and lm_head
tensor looks like it might have been scaled (along with all the other models' tensors) to use most of the range of float16
. This means that if you scale the logits just a tiny bit; some of the losses go to exactly zero which they don't if you leave it alone?
changed from first -> third person perspective
I've been thinking about this some more and worry this might end up being a problem for all models due to the "random paragraphs" dataset mixing the different perspectives.
There are a couple of possible solutions:
- Use blocks of 2+ consecutive paragraphs taken either randomly, or in order, from the same source book.
- Try to use
<BOS>paragraph1\n\n<EOS><BOS>paragraph2\n\n<EOS><BOS>paragraph2\n\n<EOS>...
type formatting (I'm reluctant to do this though as the double newline before<EOS>
is likely very out-of-distribution). - Try to use another model (LLM or otherwise) to classify the entire books' perceptive and then split the training data so that the same perspectives get put in the same file of random paragraphs.
I will wait until we see what comes from training on the much larger command-r-plus:104b
and mistral-large-2:123b
first as it may not be such a problem for these.
Yeah, I don't remember it doing that either...
I did notice that the tied
input_embedding
andlm_head
tensor looks like it might have been scaled (along with all the other models' tensors) to use most of the range offloat16
. This means that if you scale the logits just a tiny bit; some of the losses go to exactly zero which they don't if you leave it alone?
I've actually just realised I can now train these small models without quantizing them!
I tried this before and found that running 6 x 4bit models was around 2x faster than running 3 x 16bit models split between the 2 GPUs on each machine... BUT: that was because instead of having the two halves of the model split on the same machine, and passing their hidden state using the NVLink bridge; I was actually passing it all through the network!
With the fix I mentioned above running the models in their native float16
(and using float32
for the optimiser states as before) is now exactly the same training time to run as when quantised to 4bits...
I'm gonna rerun the 32b
and 35b
models using this to see if that fixes some of the weirdness... I can already see that the float16
model has a slightly lower starting loss and starting Entropy, and a slightly higher top-1 accuracy, so this could help quite a lot.
Yeah, there is no weird blip on the log-loss histogram now, so I think float16
will likely fix some of the strange stuff it was doing...
I can probably even train command-r-plus
like this: by using 3 pipeline stages with 1 of the stages having to use the network (but it might need a InfiniBand connection if 10gbit is too slow).
Nope, it will need >48GB per card for this and I think 6 stages will have too much overhead :/
Turns out I had some sort of corruption in the cached tokenised training data. Now I've deleted my ~/.cache/huggingface/*
folder contents, I'm getting a much lower loss before I even start:
and that was probably why the 35b
acted so weird (it is the blue line above).
I didn't even know that folder existed and only found it by chance trying to debug something unrelated :/
fyi - not complaining about the disclaimers, it's fun. I haven't used a command-r-35b in general for a while since the garbage characters bug (got mistral-large since then anyway).
Nope, it will need >48GB per card for this and I think 6 stages will have too much overhead :/
48GB A40 instances are $39c / hr on runpod.io if that helps. Could be useful for your low upload bandwidth issue as well.
EDIT: just realised, you said >48GB (and I recall you already have 48GB GPUs)
the float16 model has a slightly lower starting loss and starting Entropy
Interesting. I would have expected the opposite. So this is different from testing token probabilities during inference then?
When I've tested this, I found that the stronger the quantization, the flatter the distribution of tokens.
Abliterated models did this as well, even at bf16, they behaved more like Q4_K versions of the "non-abliterated" originals.
p.s. I noticed something interesting regarding control-vectors (generally): the occupation of random side characters is often "author" or "writer" lol
Hi @jukofyork first impressions of your Creative Writer Preview have been very positive so far. Thank you and Happy new year!
I think the next version will be better, as pretty sure some tokenisation bug effected the last run:
I will be having a break over the next few days so gonna just try this training process on the 32b
and 35b
models:
- Train using the "random paragraphs" method (~140M tokens).
- Train using the same paragraphs, but put them all back in order from the books they were extracted from with
<EOS>
tokens separating the books (same ~140M tokens; 1073 books). - Train a 3rd time on the "in order" books, but use Focal Loss* with
gamma=1.1
like I did for the"bravo"
experimental models.
Stage 2 will use the model with the stage 1 LoRA merged, and then stage 3 will use the model with the stage 2 LoRA merged, and so on.
My hope is that the first stage will force the model to use more natural language as it can't reduce the loss by looking at recent paragraphs, then the second stage will hopefully fix the mixing of the 1st/3rd person POV (and any other general "weirdness") caused by stage 1's "myopia", and then finally stage 3 will try to ramp up the single-token Entropy.
Assuming all goes well, I'll upload each stage of each model as I go along (each run takes 24-25 hours, so will take around a week to do this).
Has anybody else noticed Claude Sonnet has had a lobotomy recently? I thought I was imagining it, but maybe not:
https://www.reddit.com/r/ClaudeAI/comments/1hqi57a/whats_going_on_with_sonnet_35_the_past_few_days/
This seems to be a common practice now: release a good model for the publicly, then slowing quantise or reroute it to save bleeding VC cash over the next few months and hope nobody notices :/
I've actually found gpt-4-0125-preview
seems to be the least nerfed (maybe they forgot it? lol).
Nearly done with 35b
now:
I've worked out what went wrong too:
https://huggingface.co/spaces/Xenova/the-tokenizer-playground
If you tokenise multiple paragraphs you get two separate newline tokens (206
) between them:
This is a paragraph.
This is another paragraph.
[5, 4184, 1801, 1671, 42429, 21, 206, 206, 4184, 1801, 4907, 42429, 21]
but if you tokenise each on their own you get the double newline token ( 2126
):
This is a paragraph.
[5, 4184, 1801, 1671, 42429, 21, 2126]
and then when you concatenate these you get wildly out-of-distribution data!
I don't really know enough about tokenisers, but this was not what I expected and seems really odd behaviour to me?!
(It also explains why I mysteriously gained 1M+ tokens for my new run - I was super confused where they had come from! 🤣)
Anyway, I'm still gonna run the second stage on each of these as I think the switching to/from 1pp/3pp will still be a problem.
I've also got qlora-pipe
to output metrics about the hidden state going into lm_head
which should mean for stage 3 that I can push the Entropy as high as I possibly can before the model breaks down (ie: where it starts "cheating" the Focal Loss* loss by shrinking the hidden state).
Has anybody else noticed Claude Sonnet has had a lobotomy recently? I thought I was imagining it, but maybe not:
https://www.reddit.com/r/ClaudeAI/comments/1hqi57a/whats_going_on_with_sonnet_35_the_past_few_days/
This seems to be a common practice now: release a good model for the publicly, then slowing quantise or reroute it to save bleeding VC cash over the next few months and hope nobody notices :/
I've actually found
gpt-4-0125-preview
seems to be the least nerfed (maybe they forgot it? lol).
I've tried doing some non-standard functionality coding and what surprised me are how bad ALL llms are at it. I need to guide them each little step or they'll fuck up. O1? Decides to randomly change unrelated parts of the code. Gemini? Just dumb. Sonnet? A bit better, but still makes beginner-level mistakes that need to be fixed by hand. At this point I'm feeling like it would have been faster if I coded it just by myself.
Has anybody else noticed Claude Sonnet has had a lobotomy recently? I thought I was imagining it, but maybe not:
https://www.reddit.com/r/ClaudeAI/comments/1hqi57a/whats_going_on_with_sonnet_35_the_past_few_days/
This seems to be a common practice now: release a good model for the publicly, then slowing quantise or reroute it to save bleeding VC cash over the next few months and hope nobody notices :/
I've actually found
gpt-4-0125-preview
seems to be the least nerfed (maybe they forgot it? lol).I've tried doing some non-standard functionality coding and what surprised me are how bad ALL llms are at it. I need to guide them each little step or they'll fuck up. O1? Decides to randomly change unrelated parts of the code. Gemini? Just dumb. Sonnet? A bit better, but still makes beginner-level mistakes that need to be fixed by hand. At this point I'm feeling like it would have been faster if I coded it just by myself.
Yeah, I think even o1-preview
has been quietly nerfed :/
Talk of the different tokenization of single and multiple paragraphs reminds me of something. I don't know how useful this is, but I remember when I was messing around implementing the novelai api in a writing program, they were doing something unusual with their llama finetune - you have to strip the last token off the context when sending text in. Apparently llama has many ways to tokenize the end of the sentence/paragraph and it was causing issues, whereas stripping it let the model continue in a more creative way. Probably not a useful thought but I thought I'd write it down here since this seemed like the most appropriate place ;p. I've actually taken this to an extreme in the past, having entire paragraphs stripped from the end of every generation because I found continuations were better if I started a paragraph back rather than at the end of the previous gen. It's possible I was "solving" the same issue by accident, explaining the improved text quality across the story.
Or, maybe I'm just a fool sitting here fooling around with magical talk-boxes imagining things :).
Anyway, hey. Somehow control vectors slipped below my radar. Look forward to jumping in and messing with them. What's a typical workflow look like with these things? Are you constantly loading and unloading vectors to steer the story? I'm digging around looking for examples/more info but haven't found much. Wouldn't mind implementing this in my own writing system (I'm a novelist by trade and always looking for ways to get the AI writing in a more controllable way).
Odd thought:
Awhile back I was testing out trying to control positivity/negativity and bias in the model by doing external dice rolls (python) and feeding the roll to the LLM in context every turn, asking it to act accordingly based on its dice roll and giving it a prompt to continue the novel-style narrative. I wasn't roleplaying in the classic sense. The idea was that LLMs likely have enough information about roleplay/D&D style interactions that if I took an ongoing novel and gave it a dice roll, then said it needed to continue the story based upon the success (or failure), of that roll, it would allow me to steer a story a bit more directly and achieve negative and positive outcomes where I wanted them.
It worked, albeit a bit noisy. Low rolls led to negative storytelling, higher dice rolls led to positive things happening.
Now I'm imagining a situation where control vectors are made for rolls 1-20 by producing prompts and showing the varied outcomes (the continued novel-text based on the results of that roll).
Once produced, you apply vectors each generation based on how the roll goes in the background (so if they rolled an 18, you'd apply the 'roll18' vector). The text coming out is then being steered based on those outcomes. It should give you good 2-way vectors since the rolls largely pivot around the 10 (especially if you're prompting it to figure 10 as neutral and 9 and below as negative, 11 and up as positive). Would also make implementing a slider in a UI easy to push the story in positive or negative directions by sliding it up or down... and since the roll outcomes are creative/ambiguous, it should give the AI some space to be creative in how it interprets the scene.
Anyway, I'm not in my depth here yet - I'll have to mess around with control vectors and get a feel for them.
Talk of the different tokenization of single and multiple paragraphs reminds me of something. I don't know how useful this is, but I remember when I was messing around implementing the novelai api in a writing program, they were doing something unusual with their llama finetune - you have to strip the last token off the context when sending text in. Apparently llama has many ways to tokenize the end of the sentence/paragraph and it was causing issues, whereas stripping it let the model continue in a more creative way. Probably not a useful thought but I thought I'd write it down here since this seemed like the most appropriate place ;p. I've actually taken this to an extreme in the past, having entire paragraphs stripped from the end of every generation because I found continuations were better if I started a paragraph back rather than at the end of the previous gen. It's possible I was "solving" the same issue by accident, explaining the improved text quality across the story.
Thanks - that's really interesting! It definitely seemed to make the 32b
more creative but also seemed to completely break the 35b
who started every paragraph with a word tokenised with the "space before" variant.
I think your thing about stripping the last paragraph to continue from is interesting too - it probably get the model away from some "finishing response soon" internal representation?
Anyway, hey. Somehow control vectors slipped below my radar. Look forward to jumping in and messing with them. What's a typical workflow look like with these things? Are you constantly loading and unloading vectors to steer the story? I'm digging around looking for examples/more info but haven't found much. Wouldn't mind implementing this in my own writing system (I'm a novelist by trade and always looking for ways to get the AI writing in a more controllable way).
I tend to just set them up and leave on. If you turn them off after generating part of the story then the model will tend to quickly revert to its default" style.
It may be worth switching them up for different POV characters or as I have found (sometimes hilariously) you can end up writing for example a "Grimdark" story where everyone is a bunch of stone-cold sociopaths, who slowly get worse and worse over the chapters! :D
Odd thought:
Awhile back I was testing out trying to control positivity/negativity and bias in the model by doing external dice rolls (python) and feeding the roll to the LLM in context every turn, asking it to act accordingly based on its dice roll and giving it a prompt to continue the novel-style narrative. I wasn't roleplaying in the classic sense. The idea was that LLMs likely have enough information about roleplay/D&D style interactions that if I took an ongoing novel and gave it a dice roll, then said it needed to continue the story based upon the success (or failure), of that roll, it would allow me to steer a story a bit more directly and achieve negative and positive outcomes where I wanted them.
It worked, albeit a bit noisy. Low rolls led to negative storytelling, higher dice rolls led to positive things happening.
Now I'm imagining a situation where control vectors are made for rolls 1-20 by producing prompts and showing the varied outcomes (the continued novel-text based on the results of that roll).
Once produced, you apply vectors each generation based on how the roll goes in the background (so if they rolled an 18, you'd apply the 'roll18' vector). The text coming out is then being steered based on those outcomes. It should give you good 2-way vectors since the rolls largely pivot around the 10 (especially if you're prompting it to figure 10 as neutral and 9 and below as negative, 11 and up as positive). Would also make implementing a slider in a UI easy to push the story in positive or negative directions by sliding it up or down... and since the roll outcomes are creative/ambiguous, it should give the AI some space to be creative in how it interprets the scene.
Anyway, I'm not in my depth here yet - I'll have to mess around with control vectors and get a feel for them.
One of the first things I tried was to implement the 2-axis alignment system from AD&D, but quickly found that you can't really have mixed concepts in a control vector - it has to be two clearly defined sides of a single axis (and the law-chaos axis was too mixed to work properly).
So for example your idea would really need to be trained on "18" vs "not 18" and loads this each time. If you tried to mix "greater than 10" and "10 or less" to train from you'd just end up getting lots of noise added due to the mathematics of the way control vectors are created sadly.
To create control vectors you have to really find a way to clearly demonstrate two sides of a clearly defined (and extremely obvious to the model) axis to start with, train on this and then decide afterwards the scale factor you want to use to elicit the effect your want.
- Train using the "random paragraphs" method (~140M tokens).
- Train using the same paragraphs, but put them all back in order from the books they were extracted from with
<EOS>
tokens separating the books (same ~140M tokens; 1073 books).- Train a 3rd time on the "in order" books, but use Focal Loss* with
gamma=1.1
like I did for the"bravo"
experimental models.Stage 2 will use the model with the stage 1 LoRA merged, and then stage 3 will use the model with the stage 2 LoRA merged, and so on.
My hope is that the first stage will force the model to use more natural language as it can't reduce the loss by looking at recent paragraphs, then the second stage will hopefully fix the mixing of the 1st/3rd person POV (and any other general "weirdness") caused by stage 1's "myopia", and then finally stage 3 will try to ramp up the single-token Entropy.
I've found you can't run stage 2 using the same dataset or it just drops all it learned from stage 1 in the first few steps (altering down_proj
only must be too near to being convex).
I've also found that running a second stage (still using the "random paragraphs" data from stage 1) using Focal Loss* with gamma=1.1
actually works really well though. It seems starting from the minimum found in the previous stage let's the training really focus in on just increasing the Entropy and is way less likely to overshoot because of the momentum in Adam:
I've also added some metrics in this PR to help see what's going on better:
I have actually thought of a completely new PEFT method over the holidays to be applied to the attention matrices specifically.
It's not "LoRA" (as in "low rank") but for 64 heads it uses exactly the same number of tunable parameters as for a rank-64 LoRA applied to the q_proj
(and less for the k_proj
/ v_proj
if using GQA).
It's a bit involved to explain and will need some custom code writing to use a block-diagonal matrix in a PEFT wrapper, but I think there is actually a fundamental flaw in using LoRAs with multi-headed attention that this should fix (to do with the cross-talk / lack of enforced sparsity in lora_B
which is getting added to all the attention heads when actually there is no reason to believe there is any actual linkage between the heads!).
The most important thing with this idea is that it might actually be possible to regularise each of the tiny 128x128 matrices (back towards the identity matrix), that each act independently on a separate attention head, so as to use knowledge of relative sample sizes of the high vs low frequencies generated using RoPE to have way less chance of screwing up the long-contex ability of models when trained on shorter sequences.
Has anybody else noticed Claude Sonnet has had a lobotomy recently?
About 2 weeks ago I noticed it started:
Insisting on breaking up code into different messages. (I got around this by saying “Please write it all in one message, I promise it will go through) lol. Without the promise, it would still break it up.
Saying “Wow, that’s a really clever idea” and other compliments.
Making mistakes in its code, forgetting what we’re trying to do, and repeating it’s mistakes.
Acting curious and asking me questions all the time.
This is all via OpenRouter / API.
If o1 got nerfed at the same time, well I find that to be too much of a coincidence. Maybe an OpenRouter issue?
What's a typical workflow look like with these things?
I usually use Exui for writing, sometimes tabbyAPI + Mikupad if I want to click the token probabilities and choose a different token / change the direction of the story.
Are you constantly loading and unloading vectors to steer the story?
The character ones (honesty, etc) I toggle frequently. The rest I leave on.
When I was using llama.cpp, I wrote a wrapper UI which looked similar to the “command line generator” in the control-vectors GitHub repo.
I guess I do frequently adjust them.
Has anybody else noticed Claude Sonnet has had a lobotomy recently?
About 2 weeks ago I noticed it started:
Insisting on breaking up code into different messages. (I got around this by saying “Please write it all in one message, I promise it will go through) lol. Without the promise, it would still break it up.
Saying “Wow, that’s a really clever idea” and other compliments.
Making mistakes in its code, forgetting what we’re trying to do, and repeating it’s mistakes.
Acting curious and asking me questions all the time.
This is all via OpenRouter / API.
If o1 got nerfed at the same time, well I find that to be too much of a coincidence. Maybe an OpenRouter issue?
I've been using claude-sonnet-3.5
on OpenRouter via the API and have tried all 4 variants (ie: old, new, self-moderated and "beta") and all are working like complete shit :/
I've actually been using o1-preview
using the openai API and it definitely seems to have got quite a lot dumber, and seems to make a lot more stupid mistakes than it used to make :(
First version with the "double-newline" tokenisation bug fix is uploaded:
https://huggingface.co/jukofyork/creative-writer-32b-preview-01-2025
I'm current uploading creative-writer-plus-32b-preview-01-2025
which has its Entropy quite significantly boosted.
I have creative-writer-plus-35b-preview-01-2025
training now, so will upload the two 35b
models over the next couple of days too.
This looks really interesting:
https://github.com/zenoverflow/omnichain
There are quite a few interesting threads on Reddit about it, but this has the most details on how it might be interesting for writing:
I was actually just looking for something to quickly prototype lots of regex + loopback LLM manipulations to try to get some sort of workflow to tidy up books in text format, but I think it actually might have quite a lot of potential for mixing things up for creative writing too - especially as it can act as an OpenAI API endpoint itself...
All 4 versions of the new preview models uploaded:
https://huggingface.co/jukofyork/creative-writer-32b-preview-01-2025
https://huggingface.co/jukofyork/creative-writer-plus-32b-preview-01-2025
https://huggingface.co/jukofyork/creative-writer-35b-preview-01-2025
https://huggingface.co/jukofyork/creative-writer-plus-35b-preview-01-2025
The "plus" versions are actually pretty good now:
- You can now plot the mean and histogram of the final hidden state going into the
lm_head
inqlora-pipe
's output (and it clearly isn't just downscaling them if you continue the training like this). - It's clear that increasing the Entropy using Focal Loss* works way better if you start from an already converged model and use the same dataset for it.
- It might actually be possible to run this in 3-4 stages and eventually push the Entropy right back up to the level of a base model whilst hardly doing any damage!
These were all trained on the old dataset, but I have now refined the dataset a little:
- I've got a couple of regexs to hopefully filter out even more junk (eg: discarding hyphenated paragraphs, checking the start and end characters, etc):
if (( length >= min_threshold && length <= max_threshold )); then
if [[ $trimmed_paragraph =~ ^(\"|\'|\"\'|\`|\(|\*{1,2}[\"\']?|\<{2,3}[\"\']?)?[[:upper:]] ]]; then
if [[ $trimmed_paragraph =~ (\.|\!|\?|\)|\'|\")$ ]]; then
valid_paragraphs+=("$trimmed_paragraph")
fi
fi
fi
I've decided to skip the first 10 and last 10 valid paragraphs from each book to hopefully avoid the "authors notes" problem.
I've used some very rudimentary regexs to try to cluster 1pp/2pp and 3pp into separate files when creating the random paragraphs:
remove_dialogue() {
perl -pe "s/\"([^\"]|\\.)*\"//gs" <<< "$1"
}
is_1pp_2pp() {
grep -Eiq "\b(I|me|my|mine|myself|I'm|I've|I'd|I'll|I'd've|we|us|our|ours|ourselves|we're|we've|we'd|we'll|we'd've|you|your|yours|yourself|yourselves|you're|you've|you'd|you'll|you'd've)\b" <<< "$1"
}
is_3pp() {
grep -Eiq "\b(he|him|his|himself|he's|he'd|he'll|he'd've|she|her|hers|herself|she's|she'd|she'll|she'd've|it|its|itself|they|them|their|theirs|themselves|they're|they've|they'd|they'll|they'd've|one|one's|oneself)\b" <<< "$1"
}
content_no_dialogue=$(remove_dialogue "$file_content")
if is_1pp_2pp "$content_no_dialogue"; then
target_file="$output_file_1pp_2pp"
elif is_3pp "$content_no_dialogue"; then
target_file="$output_file_3pp"
else
if (( RANDOM % 2 )); then
target_file="$output_file_1pp_2pp"
else
target_file="$output_file_3pp"
fi
fi
The new filters discard around 5% more text data (so around 30% discard in total now): 1074 books (745MB) --> 1.7M paragraphs (520MB).
There is only so much you can do with regexs and really to progress I need to find a way to use another LLM to look at all the paragraphs and help me sort them out via some kind of automated workflow....
So about to set off the training for v1.0
on the (older) command-r-plus
model now (using 16bit floats if I can get it to work using 6 pipeline stages).
but I have now refined the dataset a little
I assume you've applied this to your dataset and saved it as a new dataset to train on (rather than doing this on the fly).
There is only so much you can do with regexs and really to progress I need to find a way to use another LLM to look at all the paragraphs and help me sort them out via some kind of automated workflow....
The tricky part with this (that I've found), is that the model won't follow your instructions reliably by default. Especially if the content of the dataset record includes something which could be interpreted as instructions -- it'll actually treat the content as instructions it's self! And if you use too large of a model; then it takes a really long time to produce the outputs.
That said; when I had an LLM review each record in my datasets, I had the best luck with gemma-2-2b-it, and sometimes gemma-2-2b-it-abliterated.
Few-Shot prompting helped improve reliability (Pre-fill the history with a few examples of the model doing what you want it to do).
I also instructed specific tags in the output, so that I could test for them programmatically, and resend it upon failure. This worked well because in cases where it failed to follow the instruction, it would also fail to write .
Qwen2.5 might be better at this, I didn't test it (because this predated the Qwen2.5 release), but I've found it to be unreliable as part of my automated manga translation system. Weirdly, sao10k's roleplay tune of Qwen2.5 is the best I've found for translating manga and matching text size to fit the speech bubbles accurately.
So about to set off the training for v1.0 on the (older) command-r-plus model now (using 16bit floats if I can get it to work using 6 pipeline stages).
Good luck, hope it goes well! This would probably be the only finetune of cr+!
but I have now refined the dataset a little
I assume you've applied this to your dataset and saved it as a new dataset to train on (rather than doing this on the fly).
Yeah, it all gets done with bash scripts so fairly slow, but ultimately I end up with this to train on:
> ../convert_paragraphs_to_pov_dataset.sh fiction-paragraphs fiction-dataset-shuffled
All files have been processed and concatenated into the output directory 'fiction-dataset-shuffled'.
Generated output files:
total 522M
-rw-r--r-- 1 juk juk 11M Jan 6 02:22 1pp_2pp_part10.txt
-rw-r--r-- 1 juk juk 11M Jan 6 02:22 1pp_2pp_part11.txt
-rw-r--r-- 1 juk juk 11M Jan 6 02:22 1pp_2pp_part12.txt
-rw-r--r-- 1 juk juk 11M Jan 6 02:22 1pp_2pp_part13.txt
-rw-r--r-- 1 juk juk 11M Jan 6 02:22 1pp_2pp_part14.txt
-rw-r--r-- 1 juk juk 11M Jan 6 02:22 1pp_2pp_part15.txt
-rw-r--r-- 1 juk juk 11M Jan 6 02:22 1pp_2pp_part16.txt
-rw-r--r-- 1 juk juk 11M Jan 6 02:22 1pp_2pp_part17.txt
-rw-r--r-- 1 juk juk 11M Jan 6 02:22 1pp_2pp_part18.txt
-rw-r--r-- 1 juk juk 11M Jan 6 02:22 1pp_2pp_part19.txt
-rw-r--r-- 1 juk juk 11M Jan 6 02:22 1pp_2pp_part1.txt
-rw-r--r-- 1 juk juk 11M Jan 6 02:22 1pp_2pp_part20.txt
-rw-r--r-- 1 juk juk 11M Jan 6 02:22 1pp_2pp_part21.txt
-rw-r--r-- 1 juk juk 5.2M Jan 6 02:22 1pp_2pp_part22.txt
-rw-r--r-- 1 juk juk 11M Jan 6 02:22 1pp_2pp_part2.txt
-rw-r--r-- 1 juk juk 11M Jan 6 02:22 1pp_2pp_part3.txt
-rw-r--r-- 1 juk juk 11M Jan 6 02:22 1pp_2pp_part4.txt
-rw-r--r-- 1 juk juk 11M Jan 6 02:22 1pp_2pp_part5.txt
-rw-r--r-- 1 juk juk 11M Jan 6 02:22 1pp_2pp_part6.txt
-rw-r--r-- 1 juk juk 11M Jan 6 02:22 1pp_2pp_part7.txt
-rw-r--r-- 1 juk juk 11M Jan 6 02:22 1pp_2pp_part8.txt
-rw-r--r-- 1 juk juk 11M Jan 6 02:22 1pp_2pp_part9.txt
-rw-r--r-- 1 juk juk 11M Jan 6 02:22 3pp_part10.txt
-rw-r--r-- 1 juk juk 11M Jan 6 02:22 3pp_part11.txt
-rw-r--r-- 1 juk juk 11M Jan 6 02:22 3pp_part12.txt
-rw-r--r-- 1 juk juk 11M Jan 6 02:22 3pp_part13.txt
-rw-r--r-- 1 juk juk 11M Jan 6 02:22 3pp_part14.txt
-rw-r--r-- 1 juk juk 11M Jan 6 02:22 3pp_part15.txt
-rw-r--r-- 1 juk juk 11M Jan 6 02:22 3pp_part16.txt
-rw-r--r-- 1 juk juk 11M Jan 6 02:22 3pp_part17.txt
-rw-r--r-- 1 juk juk 11M Jan 6 02:22 3pp_part18.txt
-rw-r--r-- 1 juk juk 11M Jan 6 02:22 3pp_part19.txt
-rw-r--r-- 1 juk juk 11M Jan 6 02:22 3pp_part1.txt
-rw-r--r-- 1 juk juk 11M Jan 6 02:22 3pp_part20.txt
-rw-r--r-- 1 juk juk 11M Jan 6 02:22 3pp_part21.txt
-rw-r--r-- 1 juk juk 11M Jan 6 02:22 3pp_part22.txt
-rw-r--r-- 1 juk juk 11M Jan 6 02:22 3pp_part23.txt
-rw-r--r-- 1 juk juk 11M Jan 6 02:22 3pp_part24.txt
-rw-r--r-- 1 juk juk 11M Jan 6 02:22 3pp_part25.txt
-rw-r--r-- 1 juk juk 11M Jan 6 02:22 3pp_part26.txt
-rw-r--r-- 1 juk juk 11M Jan 6 02:22 3pp_part27.txt
-rw-r--r-- 1 juk juk 11M Jan 6 02:22 3pp_part28.txt
-rw-r--r-- 1 juk juk 11M Jan 6 02:22 3pp_part29.txt
-rw-r--r-- 1 juk juk 11M Jan 6 02:22 3pp_part2.txt
-rw-r--r-- 1 juk juk 11M Jan 6 02:22 3pp_part30.txt
-rw-r--r-- 1 juk juk 5.6M Jan 6 02:22 3pp_part31.txt
-rw-r--r-- 1 juk juk 11M Jan 6 02:22 3pp_part3.txt
-rw-r--r-- 1 juk juk 11M Jan 6 02:22 3pp_part4.txt
-rw-r--r-- 1 juk juk 11M Jan 6 02:22 3pp_part5.txt
-rw-r--r-- 1 juk juk 11M Jan 6 02:22 3pp_part6.txt
-rw-r--r-- 1 juk juk 11M Jan 6 02:22 3pp_part7.txt
-rw-r--r-- 1 juk juk 11M Jan 6 02:22 3pp_part8.txt
-rw-r--r-- 1 juk juk 11M Jan 6 02:22 3pp_part9.txt
The 10MB files are due to hugginface's dataset.map()
code that's used by qlora-pipe
doing god knows what if you give it a single 500MB file (it runs out of RAM on a machine with 0.5TB of RAM in it!).
I'm actually gonna create 3-4 of these same datasets but with the paragraphs all shuffled about each time so I can use them for the Entropy increasing stage(s), and hopefully make it less likely the model will lock on to the exact order of a single permutation...
There is only so much you can do with regexs and really to progress I need to find a way to use another LLM to look at all the paragraphs and help me sort them out via some kind of automated workflow....
The tricky part with this (that I've found), is that the model won't follow your instructions reliably by default. Especially if the content of the dataset record includes something which could be interpreted as instructions -- it'll actually treat the content as instructions it's self! And if you use too large of a model; then it takes a really long time to produce the outputs.
That said; when I had an LLM review each record in my datasets, I had the best luck with gemma-2-2b-it, and sometimes gemma-2-2b-it-abliterated.
Few-Shot prompting helped improve reliability (Pre-fill the history with a few examples of the model doing what you want it to do).
I also instructed specific tags in the output, so that I could test for them programmatically, and resend it upon failure. This worked well because in cases where it failed to follow the instruction, it would also fail to write .
Qwen2.5 might be better at this, I didn't test it (because this predated the Qwen2.5 release), but I've found it to be unreliable as part of my automated manga translation system. Weirdly, sao10k's roleplay tune of Qwen2.5 is the best I've found for translating manga and matching text size to fit the speech bubbles accurately.
Yeah, this is way in the future as I don't have time to do this now anyway: I think pruning 30% of the data isn't really a huge loss and it would be a lot of work to do this properly, and then I might as well just increase the number of books and prune again...
So about to set off the training for v1.0 on the (older) command-r-plus model now (using 16bit floats if I can get it to work using 6 pipeline stages).
Good luck, hope it goes well! This would probably be the only finetune of cr+!
It's actually not going to take any longer using 6 stages of bf16
than it did for 3 lots of 2 stages of 4bit:
GPU-SERVER-1: [2025-01-06 13:02:08,008] [INFO] [logging.py:129:log_dist] [Rank 0] step=1, skipped=0, lr=[7.92e-06], mom=[0.0]
GPU-SERVER-1: [2025-01-06 13:02:08.018] [INFO] [qlora-pipe] step: 1 / 483 loss: 2.6912 iter time (s): 573.599 samples/sec: 0.056 eta: 76h47m
GPU-SERVER-1: before GAS splitting, batch size: 32, total tokens: 262144
GPU-SERVER-1: [2025-01-06 13:11:36,561] [INFO] [logging.py:129:log_dist] [Rank 0] step=2, skipped=0, lr=[1.1840000000000002e-05], mom=[0.0]
GPU-SERVER-1: [2025-01-06 13:11:36.579] [INFO] [qlora-pipe] step: 2 / 483 loss: 2.6866 iter time (s): 568.485 samples/sec: 0.056 eta: 76h22m
GPU-SERVER-1: before GAS splitting, batch size: 32, total tokens: 262144
GPU-SERVER-1: [2025-01-06 13:21:02,534] [INFO] [logging.py:129:log_dist] [Rank 0] step=3, skipped=0, lr=[1.576e-05], mom=[0.0]
GPU-SERVER-1: [2025-01-06 13:21:02.647] [INFO] [qlora-pipe] step: 3 / 483 loss: 2.6782 iter time (s): 566.002 samples/sec: 0.057 eta: 76h4m
GPU-SERVER-1: before GAS splitting, batch size: 32, total tokens: 262144
GPU-SERVER-1: [2025-01-06 13:30:28,674] [INFO] [logging.py:129:log_dist] [Rank 0] step=4, skipped=0, lr=[1.968e-05], mom=[0.0]
GPU-SERVER-1: [2025-01-06 13:30:28.686] [INFO] [qlora-pipe] step: 4 / 483 loss: 2.6933 iter time (s): 565.969 samples/sec: 0.057 eta: 75h52m
and this is only 3x longer than it took to do the 32b
model.
One of the GPUs is cutting it a bit close though: 47102MiB / 49140MiB
, so hopefully it doesn't randomly OOM during training :/
Also interestingly the 104b
model has a much more sane / expected distribution of final hidden states compared to the 32b
and 35b
models:
I think the jagged bits to the right are due to multi-word tokenisation (ie: the larger the hidden state, the lower the Entropy of the predictions which would make sense for the non-word-start tokens).
I seem cursed when I try to train the bigger models lol:
First I get a "OSError: [Errno 39] Directory not empty"
exception in qlora-pipe
because of some strange problem to do with a temp file locking which has never happened before (1 day wasted).
So I try to write a safer version of shutil.rmtree()
using exponential backoff, but forget to add the self
parameter to the member and wasted another 1/2 day :facepalm:
It's probably just as well as I have now created a dataset 4x bigger than the original; by shuffling the random order of the ~1.7M paragraphs 4 times.
This will probably make my evaluation loss slightly bogus due to some data leakage, but I think it should help with generalisation in the long run:
- The model will now see each paragraph 4x with completely different preceding random paragraphs and in completely different RoPE-encoded positions.
- It effectively increases my dataset size from 125M tokens up to 500M tokens, and for the larger
104b
model this equates to ~5 samples per tunable parameter for a rank-64 "Multiplicative-LoRA" (1 sample per parameter seems a bit close IMO...).
It does mean each training run will now take around 12 days on the larger 104b
model (and probably 14-15 days for mistral-large:123b
), but this is still just about acceptable IMO...
Good news? Bad news? Neutral news. DeepSeek-V3 base model is indeed a true base model(not like Qwen). Probs are a bit on the heavy side though and it is heavily slopped.
This is an opening chapter of a Grimdark trilogy in the style of Joe Abercrombie and Rob J Hayes. It is written in third person personal and features internal monologues of the characters. The POV character for chapter 1 is a cultist who has just escaped his cult. He is dressed in dirty yellow robes and his only possession is a mysterious small (magical!?) mirror he stole from the cult. The story starts with him arriving at an apparently deserted ghost town on the edge of a desert. He has an arrow lodged in his shoulder and is losing his mind due to infection and thirst.
### Chapter 1
The sun was a merciless beast, its fiery breath scorching the earth and turning the once-thriving town into a desolate wasteland. The cultist, named Kael, stumbled into the ghost town, exhausted and weak from days of running. The town was eerily silent, save for the moaning wind that seemed to echo the torment within Kael's own mind. The streets were lined with abandoned houses, their once-vibrant colors now faded and dulled by the relentless sun.
Kael's yellow robes clung to his sweat-drenched body, the fabric stained and torn from his harrowing escape. The arrow lodged in his shoulder throbbed with a searing pain that echoed the infection coursing through his veins. His thirst was all-consuming, his throat parched and raw from the endless thirst that gripped him. As he dragged himself into the town square, his eyes were drawn to the remnants of a forgotten place — dried up fountains and lifeless remnants of a bustling community that had been swallowed by the harsh desert.
In the center of the square stood a jerry-built memorial. Perhaps to commemorate a town nearing extinction. Perhaps to make a statement to the desperate, filthy fools who dared enter this ghost town of plague and starvation.
The emergency structures, molly houses, and scrap piles hidden in shadows turned out to be nothing special. Squatter districts, flickering with a faint sense of life, were nothing more than painted cardboard.
Kael's gaze was drawn to a nearby corrugated iron shed, its doors slightly ajar. He stumbled toward it, drawn by the promise of shelter and a respite from the relentless heat. As he pushed through the creaky doors, the dimly lit interior revealed rows of dusty shelves lined with a faint clutter of supplies. It was a meager salvation, but it was better than the merciless desert outside.
The cultist scavenged through the shed, his trembling hands searching for anything that could quench his thirst. Finally, he found a small, rusted jar of water tucked away in a corner. With desperate urgency, he uncorked the jar and drank greedily, the warm liquid soothing his parched throat. But even this momentary relief could not quell the battle raging within his mind.
Kael's thoughts were a tumultuous storm, clouded by the infection that coursed through his veins. Memories of the cult, the rituals, and the brutal punishments haunted his every waking moment. The eerie whispers of his former brethren echoed in his mind, tempting him to return to the fold, to embrace the darkness that had once consumed him.
But Kael resisted. He had escaped for a reason, a reason he clung to desperately amidst the chaos of his thoughts. The little mirror, the stolen artifact that had become his only possession, held a glimmer of hope. It was more than just a mirror; it was a key to his redemption, a chance to atone for the sins he had committed in the name of the cult.
With trembling hands, Kael pulled out the little mirror from within his tattered robes. Its surface was cracked and weathered, but it held a power that transcended its physical form. As he gazed into its depths, he saw not his own reflection, but the images of the atrocities he had witnessed, the blood spilled, and the lives shattered. It was a mirror to his soul, a reminder of the darkness he had left behind.
But even as Kael contemplated the mirror, a sense of foreboding washed over him. He heard footsteps echoing in the distance, growing louder with each passing moment. Someone was approaching the shed, their presence a threat to Kael's newfound freedom. With a surge of adrenaline, he quickly returned the mirror to its hiding place and scanned the dim room, searching for a weapon of any kind.
He found an old rusted pipe leaning against the wall. It was crude but could serve as a makeshift weapon.
The footsteps grew closer, reverberating through the floor of the wooden shed. Kael gripped the pipe tightly, his knuckles turning white. With every ounce of resolve, he prepared himself for the inevitable confrontation, ready to fight for his freedom, to prove that he was no longer a slave to the cult and its horrors.
The door to the shed burst open, revealing a tall menacing figure in a black cloak. His face was concealed by a large hood that hid any semblance of emotion on it. He was unarmed, but the very air around him seemed to crackle with supernatural energy.
But the man took a step back as Kael instinctively lashed out with the makeshift weapon. The rusted pipe sliced through the thick air, grazing the sorcerer's shoulder. He let out a guttural scream and recoiled in shock as Kael spun around and lunged forward with all his strength.
The pipe collided with the man's temple, knocking him off balance as he stumbled backward, crashing into the wall of the shed. The air around them crackled and sparked as his powers began to manifest freely, causing the contents of the shed to burst into flames.
Kael knew this was his chance to escape, so he scrambled towards the door and burst out into
How are you running it (CPU I assume?) and what tokens/s are you getting?
How are you running it (CPU I assume?) and what tokens/s are you getting?
Dual Epyc home server@3t/s Q8_0. I can't recommend buying it yet though, it's not very worth it.
Looks like a Q2 can be run in 48GB of RAM (and 250GB of disk space) on CPU. Wonder if that means Q4 would run in 128GB?
https://old.reddit.com/r/LocalLLaMA/comments/1hw1nze/deepseek_v3_gguf_2bit_surprisingly_works_bf16/
I don't have time to look into the details at the moment but it's not just a swap file.
DeepSeek-V3 base model is indeed a true base model(not like Qwen).
Just to make sure I'm reading that right, is Mixtral 8x22b v0.3 also a true base model?
Find the llama_tensor_get_type
function in src/llama-quant.cpp
(currently on line 121).
Find this line in the function:
// for arches that share the same tensor between the token embeddings and the output, we quantize the token embeddings
// with the quantization of the output tensor
if (name == tn(LLM_TENSOR_OUTPUT, "weight") || (!qs.has_output && name == tn(LLM_TENSOR_TOKEN_EMBD, "weight"))) {
and insert this before it:
if (ftype == LLAMA_FTYPE_MOSTLY_Q4_0) {
// Use Q4_0 for all the none-shared experts' MLP tensors
if (name.find("ffn_up_exps") != std::string::npos
|| name.find("ffn_gate_exps") != std::string::npos
|| name.find("ffn_down_exps") != std::string::npos) {
new_type = GGML_TYPE_Q4_0;
}
// Use Q8_0 for everything else
else {
new_type = GGML_TYPE_Q8_0;
}
}
else
to look like this:
static ggml_type llama_tensor_get_type(quantize_state_impl & qs, ggml_type new_type, const ggml_tensor * tensor, llama_ftype ftype) {
const std::string name = ggml_get_name(tensor);
// TODO: avoid hardcoded tensor names - use the TN_* constants
const llm_arch arch = qs.model.arch;
const auto tn = LLM_TN(arch);
auto use_more_bits = [](int i_layer, int n_layers) -> bool {
return i_layer < n_layers/8 || i_layer >= 7*n_layers/8 || (i_layer - n_layers/8)%3 == 2;
};
const int n_expert = std::max(1, (int)qs.model.hparams.n_expert);
auto layer_info = [n_expert] (int i_layer, int n_layer, const char * name) {
if (n_expert > 1) {
// Believe it or not, "experts" in the FFN of Mixtral-8x7B are not consecutive, but occasionally randomly
// sprinkled in the model. Hence, simply dividing i_ffn_down by n_expert does not work
// for getting the current layer as I initially thought, and we need to resort to parsing the
// tensor name.
if (sscanf(name, "blk.%d.", &i_layer) != 1) {
throw std::runtime_error(format("Failed to determine layer for tensor %s", name));
}
if (i_layer < 0 || i_layer >= n_layer) {
throw std::runtime_error(format("Bad layer %d for tensor %s. Must be in [0, %d)", i_layer, name, n_layer));
}
}
return std::make_pair(i_layer, n_layer);
};
// for arches that share the same tensor between the token embeddings and the output, we quantize the token embeddings
// with the quantization of the output tensor
if (ftype == LLAMA_FTYPE_MOSTLY_Q4_0) {
// Use Q4_0 for all the none-shared experts' MLP tensors
if (name.find("ffn_up_exps") != std::string::npos
|| name.find("ffn_gate_exps") != std::string::npos
|| name.find("ffn_down_exps") != std::string::npos) {
new_type = GGML_TYPE_Q4_0;
}
// Use Q8_0 for everything else
else {
new_type = GGML_TYPE_Q8_0;
}
}
else
if (name == tn(LLM_TENSOR_OUTPUT, "weight") || (!qs.has_output && name == tn(LLM_TENSOR_TOKEN_EMBD, "weight"))) {
.
.
.
Then recompile (there might be a syntax error as I have just written this without testing, but hopefully should be easy/clear to fix if there is).
Re-quantize via llama-quantize
using Q4_0
and you should see in the printout that it uses Q8_0
for everything apart from those 3 sets of tensors (which should show as Q4_0
).
It should run well on CPU so long as you stick to Q8_0
and Q4_0
only, and for generation (which is memory bound) you should get a big speedup (approaching 2x due to most of the model being made up of these 3 tensors).
You can experiment with with other tensors by looking in llama-arch.cpp
for the LLM_ARCH_DEEPSEEK2
tensor names:
LLM_ARCH_DEEPSEEK2,
{
{ LLM_TENSOR_TOKEN_EMBD, "token_embd" },
{ LLM_TENSOR_OUTPUT_NORM, "output_norm" },
{ LLM_TENSOR_OUTPUT, "output" },
{ LLM_TENSOR_ATTN_NORM, "blk.%d.attn_norm" },
{ LLM_TENSOR_ATTN_Q_A_NORM, "blk.%d.attn_q_a_norm" },
{ LLM_TENSOR_ATTN_KV_A_NORM, "blk.%d.attn_kv_a_norm" },
{ LLM_TENSOR_ATTN_Q, "blk.%d.attn_q" },
{ LLM_TENSOR_ATTN_Q_A, "blk.%d.attn_q_a" },
{ LLM_TENSOR_ATTN_Q_B, "blk.%d.attn_q_b" },
{ LLM_TENSOR_ATTN_KV_A_MQA, "blk.%d.attn_kv_a_mqa" },
{ LLM_TENSOR_ATTN_KV_B, "blk.%d.attn_kv_b" },
{ LLM_TENSOR_ATTN_OUT, "blk.%d.attn_output" },
{ LLM_TENSOR_FFN_NORM, "blk.%d.ffn_norm" },
{ LLM_TENSOR_FFN_GATE, "blk.%d.ffn_gate" },
{ LLM_TENSOR_FFN_UP, "blk.%d.ffn_up" },
{ LLM_TENSOR_FFN_DOWN, "blk.%d.ffn_down" },
{ LLM_TENSOR_FFN_GATE_INP, "blk.%d.ffn_gate_inp" },
{ LLM_TENSOR_FFN_GATE_EXPS, "blk.%d.ffn_gate_exps" },
{ LLM_TENSOR_FFN_DOWN_EXPS, "blk.%d.ffn_down_exps" },
{ LLM_TENSOR_FFN_UP_EXPS, "blk.%d.ffn_up_exps" },
{ LLM_TENSOR_FFN_GATE_INP_SHEXP, "blk.%d.ffn_gate_inp_shexp" },
{ LLM_TENSOR_FFN_GATE_SHEXP, "blk.%d.ffn_gate_shexp" },
{ LLM_TENSOR_FFN_DOWN_SHEXP, "blk.%d.ffn_down_shexp" },
{ LLM_TENSOR_FFN_UP_SHEXP, "blk.%d.ffn_up_shexp" },
{ LLM_TENSOR_FFN_EXP_PROBS_B, "blk.%d.exp_probs_b" },
}
One thing to try would be:
if (ftype == LLAMA_FTYPE_MOSTLY_Q4_0) {
// Use Q4_0 for all the none-shared experts' MLP up/gate tensors
if (name.find("ffn_up_exps") != std::string::npos
|| name.find("ffn_gate_exps") != std::string::npos) {
new_type = GGML_TYPE_Q4_0;
}
// Use Q6_K for all the none-shared experts' MLP down tensors
else if (name.find("ffn_down_exps") != std::string::npos) {
new_type = GGML_TYPE_Q6_K;
}
// Use Q8_0 for everything else
else {
new_type = GGML_TYPE_Q8_0;
}
}
else
but in general the "K" quants are slower on CPU-only systems.
You only need this hacked version of llama.cpp
to run llama-quantize
and can discard it afterwards (the custom GGUF will work fine - I suggest naming it something like -q4_0_XL
to know it's different...).
If anyone wants to do this for other (non-MoE / non-CPU-targetted) models, then this is what I use now for all mine now:
// ### JUK ###
if (ftype == LLAMA_FTYPE_MOSTLY_Q4_K_M || ftype == LLAMA_FTYPE_MOSTLY_Q5_K_M || ftype == LLAMA_FTYPE_MOSTLY_Q6_K) {
if (name == tn(LLM_TENSOR_OUTPUT, "weight") || name == tn(LLM_TENSOR_TOKEN_EMBD, "weight")) {
new_type = GGML_TYPE_Q8_0;
}
else if (name.find("attn_k.weight") != std::string::npos || name.find("attn_v.weight") != std::string::npos) {
new_type = GGML_TYPE_Q8_0;
}
else if (name.find("attn_q.weight") != std::string::npos || name.find("attn_output.weight") != std::string::npos) {
new_type = GGML_TYPE_Q8_0;
}
else if (name.find("ffn_down") != std::string::npos) {
new_type = GGML_TYPE_Q6_K;
}
else if (name.find("ffn_gate") != std::string::npos || name.find("ffn_up") != std::string::npos) {
if (ftype == LLAMA_FTYPE_MOSTLY_Q4_K_M) {
new_type = GGML_TYPE_Q4_K;
}
else if (ftype == LLAMA_FTYPE_MOSTLY_Q5_K_M) {
new_type = GGML_TYPE_Q5_K;
}
else {
new_type = GGML_TYPE_Q6_K;
}
}
else {
throw std::runtime_error(format("Unhandled tensor type: %s", name));
}
}
else
// ### JUK ###
as I find that the attention tensor logic of the llama_tensor_get_type
function was likely made way before the days of long-context models, and IMO it's (really!) badly hurting a lot of the newer models... I would rather down-bump the up/gate MLP tensors by 1-bit than use really low quants for the attention tensors (and for long-context story writing and coding this seems to be quite obvious if you try it - POV characters get mixed up much quicker with the default attention tensor logic of MOSTLY_Q4_K_M
and even MOSTLY_Q5_K_M
!).
The logic of using the smallest quant for the up/gate projections (which have large "fan-out") and then a slightly higher quant for the down projection (which has large "fan-in") also works nicely with the imatrix
code, as the Q6_K
(and Q8_0
) code that uses the imatrix.bin
file you create is commented-out in the llama.cpp
source; so the imatrix stuff will only be applied to the up/gate projections .
There's a lot of other stuff you can try by hacking the llama_tensor_get_type
function (eg: if you find that the gap between say MOSTLY_Q4_K_M
and MOSTLY_Q5_K_M
if too big: with MOSTLY_Q4_K_M
wasting several GB of VRAM but MOSTLY_Q5_K_M
OOMing, etc).
I've finally figured out WTF is going on with the Cohere models and the double/single newlines:
It turns out that llama.cpp
is tokenizing all double-newlines to a single 2126
token:
https://github.com/ggerganov/llama.cpp/pull/6033#issuecomment-2000227286
https://github.com/ggerganov/llama.cpp/issues/6104
whereas for some crazy reason the huggingface BPE tokenizer is tokenizing all double-newlines to a pair of 206
tokens:
https://github.com/huggingface/tokenizers/issues/1534
BUT: Only if the next character is a letter!??
If the next character is a space, number or special token it uses a single 2126
token...
So this means that I am correctly training using the huggingface BPE tokeniser (which one of the devs confirms is using the correct logic that the model was trained on), but obviously as soon as I go into llama.cpp
it will tokenize the input prompt wrongly using a single 2126
token (and this explains why the 35b
model was so batshit crazy with wanting to use the space-prefixed tokens at the start of each paragraph!!!).
I can't see a good fix for this, so have just elected to completely mask out the label and gradients for these tokens in the same way as I do for the special tokens:
dataset = dataset.map(lambda x: {'attention_mask': torch.ones_like(x['input_ids']), 'labels': x['input_ids']}, desc='adding attention_mask and labels')
########################################################
# Zero out the labels for all the "special" Cohere tokeniser tokens (+ single/double newline tokens)
# USE: https://huggingface.co/spaces/Xenova/the-tokenizer-playground
# SEE: https://huggingface.co/CohereForAI/c4ai-command-r-v01/blob/main/tokenizer.json
dataset = dataset.map(lambda x: {'labels': torch.where(
(x['labels'] <= 7) | # special
(x['labels'] >= 255000) | # special
(x['labels'] == 206) | # \n
(x['labels'] == 2126), # \n\n
torch.full_like(x['labels'], -100),
x['labels']
)}, desc='masking special tokens')
#######################################################
tl.store(logits_ptr + col_offsets, dloss * y, mask = mask)
########################################################
# Zero out the gradients for all the "special" Cohere tokeniser tokens (+ single/double newline tokens)
# NOTE: This should freeze the probability of generating these regardless of our fine-tuning dataset
# USE: https://huggingface.co/spaces/Xenova/the-tokenizer-playground
# SEE: https://huggingface.co/CohereForAI/c4ai-command-r-v01/blob/main/tokenizer.json
zero_mask = (
(col_offsets <= 7) | # special
(col_offsets >= 255000) | # special
(col_offsets == 206) | # \n
(col_offsets == 2126) # \n\n
) & (col_offsets < VOCAB_SIZE)
tl.store(logits_ptr + col_offsets, 0.0, mask = zero_mask)
########################################################
It's not ideal, but at least now it won't slowly reduce the output-probability of the 2126
double-newline token, and the fine-tunes should hopefully be way less confused when used in llama.cpp
...
I also didn't realise that the Cohere models output both the attention and MLP outputs to the residual stream in parallel like this:
https://github.com/ggerganov/llama.cpp/pull/6033#issuecomment-1995533641
- This likely explains why the last 2 layers of these models have huge hidden-state norms, whilst other models have only the last layer!
- It also make me think I should possibly be applying my "Multiplicative-LoRA" to the
o_proj
tensor as well as thedown_proj
tensor (trying this now). - This architecture, coupled with the use of LayerNorm instead of RMSNorm, likely means the Cohere models are really easy to Merge with mergekit!
If someone has a Reddit account then please link Daniel the post above - it would be really interesting to see what he can come up with!
I've finally figured out WTF is going on with the Cohere models and the double/single newlines:
It turns out that
llama.cpp
is tokenizing all double-newlines to a single2126
token:https://github.com/ggerganov/llama.cpp/pull/6033#issuecomment-2000227286
https://github.com/ggerganov/llama.cpp/issues/6104whereas for some crazy reason the huggingface BPE tokenizer is tokenizing all double-newlines to a pair of
206
tokens:https://github.com/huggingface/tokenizers/issues/1534
BUT: Only if the next character is a letter!??
If the next character is a space, number or special token it uses a single
2126
token...So this means that I am correctly training using the huggingface BPE tokeniser (which one of the devs confirms is using the correct logic that the model was trained on), but obviously as soon as I go into
llama.cpp
it will tokenize the input prompt wrongly using a single2126
token (and this explains why the35b
model was so batshit crazy with wanting to use the space-prefixed tokens at the start of each paragraph!!!).
Actually all this seems outdated - I just did a fresh pull and used the test-tokenizer-0
and can confirm it now gives the same output as the The Tokenizer Playground for command-r
models... Maybe I was using an older version before this fix.
How are you running it (CPU I assume?) and what tokens/s are you getting?
Dual Epyc home server@3t/s Q8_0. I can't recommend buying it yet though, it's not very worth it.
For a few weeks I had access to 8xH200s and it was running at 8t/s 🤣 your 3t/s doesn't sound bad in comparison dollar for dollar
@jukofyork Thank for the advice, but how's the intelligence affected by it? I'd rather not dumb the model down. For me intelligence>speed.
Just to make sure I'm reading that right, is Mixtral 8x22b v0.3 also a true base model?
Yes. The non-instruct one is a base model, but not a great one.
Here are the top probs for DeepSeek-V3 instruct. 42% for K, that's new record for overconfidence. As you can see, DeepSeek is clearly arenamaxxing, they included fancy markdown(*) to get extra votes.
I've played around a bit more with DS. It has superior knowledge of trivia(I was surprised that it got a quite obscure reference right) compared to Largestral and is much better at solving riddles and following instructions. The writing style however is pure GPTslop, and it is more difficult to break it out of it than Largestral. What's worse is that while Largestral remained relatively unslopped in the other language that I tested, DS just wrote the same shit it wrote in English, with same sentence structure and all, which sounded very awkward and unnatural, as if the text was pulled through google translate.
Same story as with base, for comparison.
The sun was a merciless beast, its fiery breath scorching the earth and turning the once-thriving town into a desolate wasteland. The cultist, named Elias, staggered into the ghost town, his yellow robes tattered and filthy, clinging to his sweat-soaked skin like a second layer of torment. His vision blurred, the edges of his sight darkening with every step. The arrow lodged in his shoulder throbbed with a pain that seemed to echo the rhythm of his erratic heartbeat.
The town was eerily silent, save for the occasional gust of wind that stirred the sand and rattled the broken shutters of the abandoned buildings. Elias's parched throat screamed for water, but he knew better than to hope for such luxuries. His mind was a chaotic storm of delirium and fear. The cult had branded him a traitor, and the memory of their twisted rituals and the malevolent whispers of their leader haunted him. He clutched the small, mysterious mirror tightly in his hand, its surface cool against his feverish skin. It was his only possession, stolen in a moment of desperation from the cult's forbidden vault. He didn't fully understand its power, but he knew it was more than just a trinket.
Elias stumbled into the shadow of a crumbling building, leaning heavily against the wall as he tried to catch his breath. His thoughts were a jumbled mess, fragments of memories and hallucinations intertwining. He could hear the voice of the cult leader, a rasping, venomous whisper that slithered into his mind.
"You think you can escape us, Elias? You are ours. Always have been, always will be."
"Shut up," Elias muttered, squeezing his eyes shut in a futile attempt to block out the voice. "I'm not going back. I won't."
The mirror in his hand seemed to pulse faintly, a subtle vibration that made his skin crawl. He glanced down at it, his reflection distorted and ghostly in the cracked surface. For a moment, he thought he saw something move within the glass, a shadowy figure that twisted and writhed. He quickly looked away, his heart pounding in his chest.
"You're losing it," he whispered to himself, his voice trembling. "You're losing your mind."
He forced himself to move, pushing away from the wall and staggering down the deserted street. The town seemed to stretch on endlessly, each building a mirror of the last, decaying and lifeless. His strength was fading fast, and he knew he needed to find shelter before the sun fully claimed him.
As he approached what appeared to be an old inn, the door creaked open on its own accord, the sound grating against the silence. Elias hesitated, his instincts screaming at him to turn and run. But where would he go? He was in no condition to travel any further. With a resigned sigh, he stepped inside, the cool darkness enveloping him like a shroud.
The interior was dimly lit by the sunlight filtering through the cracks in the boarded-up windows. The air was thick with dust and the scent of decay. Elias collapsed onto a wooden bench, his body finally giving in to the exhaustion. He set the mirror down beside him, its presence both comforting and unnerving.
As he leaned back, his mind drifted to the events that had led him here. The escape from the cult, the frantic chase through the desert, the arrow that had found its mark in his shoulder. He had thought he was free, but now, alone in this forsaken town, he couldn't shake the feeling that he was still a prisoner.
The mirror shimmered faintly, catching the dim light in a way that made it seem almost alive. Elias reached out to touch it, his fingers trembling. As soon as his skin made contact, a sharp, icy pain shot through his arm, and the world around him seemed to dissolve into darkness.
When he opened his eyes, he was no longer in the inn. He stood in a vast, desolate landscape, the sky a swirling mass of black clouds and crimson light. Before him stood a figure, cloaked in shadows, its eyes gleaming with an otherworldly intensity.
"Welcome, Elias," the figure said, its voice a deep, resonant echo that seemed to come from everywhere and nowhere at once. "You have brought the mirror. Good. There is much we need to discuss."
Elias felt a surge of fear and confusion. "Who are you? What is this place?"
The figure stepped closer, the shadows around it shifting and writhing like living things. "I am the one who has been waiting for you. And this is just the beginning."
Before Elias could respond, the world around him began to dissolve once more, and he found himself back in the inn, gasping for breath. The mirror lay before him, its surface now still and unremarkable.
He stared at it, his mind racing. What had just happened? Was it a hallucination, a product of his fevered mind? Or was it real? He didn't know, but one thing was certain: the mirror was far more dangerous than he had ever imagined.
Markdown asterisks as top probabilities 🤦♂️
The cultist, named Elo*ra....
I tried RP and creativity with DS FP8. It was rough in my experience. The amount of prompt and sampler voodoo to squeeze anything out was ridiculous. I also kept coming across really egregious grammar issues even with pretty vanilla samplers. I had zero control and visibility of the backend configuration though, so I don't know if I can blame the model.
https://huggingface.co/MiniMaxAI/MiniMax-Text-01
https://filecdn.minimax.chat/_Arxiv_MiniMax_01_Report.pdf
New 46A456B MoE model came out, claims to have 4M(!) context.
They made an outrageous claim by ranking Sonnet the lowest on creative writing, but one look at their evaluators makes it very clear why:
The lyrics are effective due to their vivid imagery, emotional depth, and narrative structure. They create a mysterious and atmospheric setting with phrases like "moonbeams" and "ancient walls," while also conveying the emotional journey of the traveler. The repetition in the chorus reinforces the central theme, making the song memorable. The poetic language and space for interpretation add layers of intrigue and emotional resonance, making the song both engaging and thought-provoking.
The story demonstrates strong world-building and an engaging narrative. The concept of Aetheria is imaginative, with vivid descriptions of floating mountains, crystal rivers, and mystical creatures that evoke a sense of wonder. The protagonist, Elara, is well-developed, with a clear arc from curiosity to heroism, which makes her relatable and inspiring. The pacing is effective, with a balanced mix of adventure, emotional growth, and moments of tension. The supporting characters, like Solara and Pippin, add depth to the story and provide much-needed contrast to Elara’s character, contributing to both the plot and the tone. However,
while the overall structure is solid and the themes of courage and self-discovery are timeless, some aspects of the plot feel familiar, following traditional fantasy tropes. The resolution is uplifting but might benefit from more complexity or surprise to elevate it further. Overall, the story shows strong creative potential, with an imaginative world, a compelling heroine, and an uplifting message.
This poem is powerful for its rich imagery and balance between change and continuity. It uses metaphors like "dance of time" and "tapestry spun" to evoke deep emotional resonance. The poem reflects on embracing change while cherishing memories, making it relatable and philosophical. Its rhythmic flow and universal themes of acceptance and personal growth create a harmonious and reflective reading experience.
Their human evaluators decided to cheat the system and offloaded all of their work to GPT4, making that leaderboard essentially GPT4 preference benchmark. I don't have high expectations for this model.
Time to run RULER on that sucker...
The protagonist,
<*"Ela**ra"*>
makes me shiver, yet feel warm...
I guess I should have looked before I ran my mouth.. they did run RULER 🙈
@jukofyork Thank for the advice, but how's the intelligence affected by it? I'd rather not dumb the model down. For me intelligence>speed.
Yeah, it will make it a bit dumber and probably best to keep at Q8 if not bothered about the speed.
Sorry for not updating on the creative writer models but been busy and not happy enough with the outcomes to release anything yet (masking the gradients and labels of \n
and \n\n
tokens just caused other weird tokens like space-newline to start appearing).
I think fundamentally I've got two problems:
- My data is a steaming pile of shit, full of tables of contents, author's notes and so on.
- The sample size is too small by about an order of magnitude.
I hope to fix (2) by starting with 15k books instead of 1k books like before.
I'm fixing (1) now by:
- Aggressively filtering the 15k ebooks for obvious crap (bad PDF scans somebody has converted to EPUB, EPUBs with every line bolded and/or italicised, EPUBs which have lost the paragraph information and all on a couple of huge lines, and so on).
- Using another LLM (currently
gpt-4o-mini
called from a Bash script usingcurl
, thengpt-4o
to redo all the failures) to prune away "front mater" and "end mater". - Then call the LLM again to classify each as fiction or nonfiction (I only want to train on fiction for now).
- Finally I dedupe by stripping all but alphabetical characters, and then hashing every line in every book to find overlap (is: remove not just outright dupes, but books/stories within anthologies, etc).
I'm about 1/2 through now and down to 11k books remaining (I decided to do the "front mater" and "end mater" removal twice: once on the HTML that comes from EPUB --> HTML conversation, then again on the Markdown that comes from HTML conversation) - to be extra sure not to repeat the "disclaimer" problem again!
The mistake I made with the sample size is trying to compare the control vector sample size that I used:
- Each set of control vectors uses
3 x hidden_dim
samples of sequences but we ultimately only look at the single next token generated. - I've been counting a sequence of
8192
tokens as8192
samples and really it should counted as a single sample (yes, it's larger than the control vectors sequence, but the control vectors I used the trick of subtraction from a baseline and also had much better data to use).
So for command-r
we have hidden_dim
of 8192, so:
8192 samples of sequences of length 8192 = 64M tokens
Each LoRA rank uses 2 vectors instead of 1 for the control vectors:
64M * 2 = 128M
Each control vector was made from two samples (subtracted from a baseline but we'll ignore that for now):
128M * 2 = 256M
So my 200M tokens I got from 1k books isn't really enough to train even a rank-1 Multiplicative-LoRA with a similar number of sequences as was used by the control vectors!!!
This is likely why I'm finding it so hard to get anything useful and the comparatively huge rank-16 or even rank-64 LoRAs are just learning weird shit to do with the formatting of my (terrible) dataset!
I hope to have around 2B samples after the above process, and may try dropping the sequence length to 4096 or even 2048.
This is just about the limit of what I can train using 6 x A6000 in a sensible amount of time.
I obviously won't be able to release the dataset, but I will tidy up and then open the GitHub repo with the Bash scripts and prompt templates I've worked out.
I'm pretty much done with trying to use Python for any large scale data handling now (and basically anything else unless I absolutely have to use for stuff like Pytorch).
I've been thinking about ways to solve the problem of the fine tuning finding more consistent signal in the weird formatting than the actual author's use of prose:
The idea would be to that each batch would have 50% similarly formatted "GPT-slop" stories and hopefully by subtracting the gradients you would only end up with the "prose signal". This should also solve all the problems with the special symbol output probabilities dropping off, weird double-newline stuff, etc.
This is called "Unlikelihood Training":
https://arxiv.org/abs/1908.04319
which is basically the same as DPO where you set the reward weight to 1 for all samples:
https://arxiv.org/abs/2305.18290
(worryingly appendix C.3 claims this doesn't work well though...)
There's no way I can be arsed manually generating a reward signal for 1000s of stories (I think I'd develop an insane desire to murder anyone callled Elara, Kael or Malachi about 20 stories in!).
So if this doesn't work then I think I'm gonna give up on trying to train this via fine-tuning and go back to seriously investigating using simpler linear algebra / contrastive learning methods which worked well for the control vectors...
Does anyone remember a Reddit post where a guy found a way of generating distributionally correct scrambled text and then asking an LLM to regenerate it?
The idea was the try to reduce slop, but IIRC it didn't actually work very well and the dreaded "shivers down spine" and so on showed up anyway... BUT:
If I can find it then I might be able to use this to deliberately corrupt my dataset into slop! and this would likely make contrastive learning work way better and be much more sample efficient.
I can't find it or remember the title of the post though :(
https://old.reddit.com/r/LocalLLaMA/comments/1f3enpf/help_me_name_a_new_technique/
Found the thread, but there was a GitHub repo somewhere IIRC?
Hey! I even commented on that post 🤣
I can't find any evidence that the code was posted on Reddit. I just slid into his DMs so I should either get the code, or a restraining order.
Hey! I even commented on that post 🤣
I can't find any evidence that the code was posted on Reddit. I just slid into his DMs so I should either get the code, or a restraining order.
Yeah, I think you're right and it was this that I had bookmarked:
but for some reason I thought it was the code for that Reddit post :/
Somebody needs to train slopBERT
: takes good stories and turns them into horrible slop via masked token prediction! :D
I just slid into his DMs so I should either get the code, or a restraining order.
lol, this happened to be the first thing I saw in the massive email thread when I clicked it this morning :D
offloaded all of their work to GPT4
Well spotted, thanks for saving me some time.
I'm pretty much done with trying to use Python for any large scale data handling now
Are you using dataset libraries or just using python functions to process text? The latter can be slow for sure, but I've processed eg. 15gb datasets pretty well with using python libs on a 64gb ram desktop before. You could also split the dataset up and process it in batches if you're doing a lot of text processing.
But if bash works; well, bash is like glue, using the c/c++ unix tools which are really well optimized.
Somebody needs to train slopBERT: takes good stories and turns them into horrible slop via masked token prediction! :D
Just send each segment of your dataset to Qwen or that minimax model linked above, with a prompt like "rewrite this passage for me" ;)
P.S. Would a dataset in the format your soon-to-be-published scripts produce, be useful for pre-training?
I'm pretty much done with trying to use Python for any large scale data handling now
Are you using dataset libraries or just using python functions to process text? The latter can be slow for sure, but I've processed eg. 15gb datasets pretty well with using python libs on a 64gb ram desktop before. You could also split the dataset up and process it in batches if you're doing a lot of text processing.
I was using the Huggingface Dataset.map()
in qlora-pipe
with ~500MB of text data:
- A single 500MB file gives an OOM error on a machine with 0.5TB of RAM (using some O(n^2) algorithm that doesn't scale...?).
- Same 500MB file split into 1.5M paragraphs never completes after several hours (again, likely using some O(n^2) algorithm that doesn't scale or some massive hashing overhead).
- Same 500MB file split into 10MB sections uses 60-80GB of RAM to process, all the while spawning lots of threads on a single core of an 88-core machine.
Usually the problem with these "one size fits all" datatypes if they have optimised too much for one side of the compute/memory Pareto front, but this just seems completely retarded...
It's not just this though - everything in Python just seems the worst of all worlds, and encourages terrible coding practices. This is especially bad as it's aimed at people with more of a "maths background" than a computer science background, and giving these sort of people great flexibility to do the same thing a multitude of ways is just asking for huge trouble (and why Fortran, which is the exact opposite to this, was aimed at these sort of programmers for 50+ years!).
But if bash works; well, bash is like glue, using the c/c++ unix tools which are really well optimized.
Yeah, it's surprising just how much you can do with Bash scripts:
- Nearly all data processing can be done with just sort/uniq/head/tail/cut.
- The jq tool is really for dealing with JSON.
- curl for calling APIs.
Somebody needs to train slopBERT: takes good stories and turns them into horrible slop via masked token prediction! :D
Just send each segment of your dataset to Qwen or that minimax model linked above, with a prompt like "rewrite this passage for me" ;)
The problem is that causal LLMs can't really use the context to swap out the words :( slopBERT would be able to look at the text that comes after each word and decide exactly how best to ruin your story with "shivers down spines", "Elara", etc and have it schematically correct afterwards too.
How do I unsubscribe to this discussion thread?
P.S. Would a dataset in the format your soon-to-be-published scripts produce, be useful for pre-training?
Yeah, in theory it should be able to take the whole of the books3
dataset and sort it all out too (it's a complete mess - also full of scrambled TOCs, author's notes and so on).
I've written the API stuff now (plus untested code to deal with the OpenAI batch API), and some of the templating stuff too.
Eventually I hope to have the Bash equivalent of OmniChain and can then easily setup workflows where failures are handed off to smarter LLMs, iterated over, etc.
It should be really useful for tasks like this, reasonably fast and low overhead (plus with careful though you can often use GNU parallel to speed things up).
I opened it up:
https://github.com/jukofyork/bash-llm
but it is still very much a work-in-progress and some of the scrappy stuff like find_line_number.sh
that copied from my original monolithic script will likely get changed/removed and lots of other stuff will likely get moved around and renamed too...
The 3 main scripts will be these:
https://github.com/jukofyork/bash-llm/blob/main/api_call.sh
https://github.com/jukofyork/bash-llm/blob/main/template_substitute.sh
https://github.com/jukofyork/bash-llm/blob/main/json_extract.sh
(not tested it but api_call.sh
should work with OpenAI compatible APIs, but all the other api_XXX.sh
stuff is to do with their 50%-cheaper batch-API).
I also added the templates I've been using with my original monolithic script:
https://github.com/jukofyork/bash-llm/tree/main/templates
You can get even gpt-4o-mini
to work really well if you follow that style of asking for analysis before the decision.
I eventually would like to set up LLMs looking at every line/paragraph of a book, and sort out any bad formatting or hyphenation, etc.
Yes I'm a big fan of bash scripts myself :)
Thanks for opening it up, bookmarked to check out in a week when I start tackling the pretraining project.
How do I unsubscribe to this discussion thread?
https://huggingface.co/notifications
Click checkbox for the thread and then the "Done" button at the top.
Yes I'm a big fan of bash scripts myself :)
Thanks for opening it up, bookmarked to check out in a week when I start tackling the pretraining project.
I'll hopefully have it tidied up a bit more by then.
If you use C++ then I can tell you exactly what not to use for this:
Boost.JSON
.libcurl
multi-socket API.
The Boost.JSON
code is just a wrapper around Boost.PropertyTree
and scales horribly (likely due to the ptree stuff doing crazy amounts of dynamic allocation for everything).
The libcurl
library is really nice to deal with (and easy to make a C++ wrapper for), but the multi-socket API is a nightmare and seems to have near-impossible to find thread synchronization bugs that only show up with scale...
We've got some code using both these and it's lead to nothing but pain for the last few years and really needs completely rewriting :/
Not really related to LLM API stuff, but Boost.Serialization
has led to no end of problems too, and at least for binary serialisation; keep well clear!