Jim Lai

grimjim

AI & ML interests

Experimenting primarily with 7B-12B parameter text completion models. Not all models are intended for direct use, but aim for educational and/or merge purposes.

Recent Activity

updated a collection about 5 hours ago
Quantized models
updated a model about 5 hours ago
grimjim/meta-llama-Llama-3.2-1B-Instruct-exl2
posted an update 1 day ago
Speculative decoding only requires that the tokenizers for the two LLMs used line up; the model architectures do not have to be otherwise compatible. As proof of concept, I used exllamav2 to run Llama 3.2 1B Instruct (at 6bpw, for speed) as the draft model to accelerate the target model of a Llama 3 8B merge of Instruct models (at 8bpw, for accuracy). The difference between tokenizers was minor enough to allow this. With 8k context length allocated for each model, both fit in under 13GB VRAM. https://github.com/turboderp/exllamav2 https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct https://huggingface.co/grimjim/llama-3-Nephilim-v3-8B The proof-of-concept Python script compared a zero-shot creative task of writing a story limited to 500 tokens. Speculative decoding improved performance by approximately one third (e.g., increasing from 31 tokens/sec to 46 tokens/sec) over conventional decoding, and was consistent over a few runs. While not statistically significant, this implies that smaller models aimed at edge computing can serve effectively as draft models in the general case. It is straightforward to consult literature to affirm that fine-tuning draft models can be a way of inducing behavioral change in target models, in a manner not unlike how samplers can be used to induce changes. I speculate that the impact of a fine-tuned draft model would be on part with a LoRA (Low-Rank Adaptation), as the target model retains veto power. The small size of draft model candidates means that more people can perform local full fine-tuning. It is intuitively obvious that a distilled model can be used as a draft model for the larger teacher model so long as tokenizers line up; e.g., a distilled 8B model can draft for a 70B teacher model. Perhaps Llama-3.1-SuperNova-Lite 8B could effectively draft for the original Llama-3.1-405B-Instruct model. https://huggingface.co/arcee-ai/Llama-3.1-SuperNova-Lite https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct
View all activity

Organizations

Social Post Explorers's profile picture Debased AI's profile picture Anthracite's profile picture Anthracite Core's profile picture

grimjim's activity

posted an update 1 day ago
view post
Post
505
Speculative decoding only requires that the tokenizers for the two LLMs used line up; the model architectures do not have to be otherwise compatible. As proof of concept, I used exllamav2 to run Llama 3.2 1B Instruct (at 6bpw, for speed) as the draft model to accelerate the target model of a Llama 3 8B merge of Instruct models (at 8bpw, for accuracy). The difference between tokenizers was minor enough to allow this. With 8k context length allocated for each model, both fit in under 13GB VRAM.
https://github.com/turboderp/exllamav2
meta-llama/Llama-3.2-1B-Instruct
grimjim/llama-3-Nephilim-v3-8B

The proof-of-concept Python script compared a zero-shot creative task of writing a story limited to 500 tokens. Speculative decoding improved performance by approximately one third (e.g., increasing from 31 tokens/sec to 46 tokens/sec) over conventional decoding, and was consistent over a few runs. While not statistically significant, this implies that smaller models aimed at edge computing can serve effectively as draft models in the general case.

It is straightforward to consult literature to affirm that fine-tuning draft models can be a way of inducing behavioral change in target models, in a manner not unlike how samplers can be used to induce changes. I speculate that the impact of a fine-tuned draft model would be on part with a LoRA (Low-Rank Adaptation), as the target model retains veto power. The small size of draft model candidates means that more people can perform local full fine-tuning.

It is intuitively obvious that a distilled model can be used as a draft model for the larger teacher model so long as tokenizers line up; e.g., a distilled 8B model can draft for a 70B teacher model. Perhaps Llama-3.1-SuperNova-Lite 8B could effectively draft for the original Llama-3.1-405B-Instruct model.
arcee-ai/Llama-3.1-SuperNova-Lite
meta-llama/Llama-3.1-405B-Instruct
posted an update about 2 months ago
view post
Post
1998
To demonstrate that it was possible, I performed a "trapezoid" gradient merge of a Llama 3 8B model onto Llama 3.1 8B Instruct, favoring the L3.1 model at the ends in order to preserve coherence and limiting the influence of the L3 model to at most 0.1 weight. Tested to 16k context length.
grimjim/Llama-Nephilim-Metamorphosis-v2-8B
replied to their post 2 months ago
view reply

Those look like prefills. Unless you want to train for prefill-specific outputs, it makes sense to remove them.

replied to their post 2 months ago
view reply

For DPO, I'd stick with what HF recommends, which in their example does not have prompt repetition.
https://huggingface.co/docs/trl/main/en/dpo_trainer

Offhand, for multi-turn data, I'd go with what the LLM "sees" in practice, so prior turns are probably part of the prompt, and "chosen" and "rejected" guide what text generation occurs.

posted an update 3 months ago
view post
Post
1955
I was reading through an abstract and found myself wondering how much LLM performance is being left on the table due to insufficient curation of training datasets: "Instruct-SkillMix: A Powerful Pipeline for LLM Instruction Tuning" by Kaur, Park, Goyal, Arora.
https://arxiv.org/abs/2408.14774
In particular, the observation that "Introducing low quality answers ("shirkers") in 20% of Instruct-SkillMix examples causes performance to plummet..." had me wondering how many ostensibly good datasets out there are in fact populated with a significant number of "shirkers".
ยท
posted an update 3 months ago
view post
Post
3229
I found this paper to be thought-provoking: "Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling" by Bansal, Hosseini, Agarwal, Tran, and Kazemi.
https://arxiv.org/abs/2408.16737
The direct implication is that smaller models could be used to create cost-effective synthetic datasets. And on that note, in the Gemma terms of use, Google explicitly claims no rights on outputs generated from those models, which means one is free to synthgen from the Gemma line. Meta's Llama 3 licence forbids synthetic generation of outputs if used to improve other models. Relevant Mistral, Qwen, and Yi models under the Apache 2.0 license are unrestricted for this purpose.
  • 2 replies
ยท
posted an update 3 months ago
view post
Post
1636
This merge, this time grounded in Gemma2 9B Instruct fine-tunes, is another demonstration that models without any fine-tuning to support roleplay can still perform the function, maintaining coherence and attention to context. It should be evident that no overt fine-tuning is required for roleplay in text generation; pretraining should provide models with a requisite basic understanding of the world, so all that should be needed is some corrective fine-tuning to address observed defects in portraying the world along with datasets to promote a suitably entertaining writing style. Good Instruct tuning should promote reasoning, coherence, and attention to context.
grimjim/Kitsunebi-v1-Gemma2-8k-9B
grimjim/Kitsunebi-v1-Gemma2-8k-9B-GGUF

I opted not to incorporate the UCLA SPPO fine-tune for Gemma2 9B after observing context confusion occur with some frequency during complex scenarios.

Thanks to Axcxept co., ltd. for fine-tuning HODACHI/EZO-Common-9B-gemma-2-it, and to Princeton NLP Group for fine-tuning princeton-nlp/gemma-2-9b-it-SimPO.
AXCXEPT/EZO-Common-9B-gemma-2-it
princeton-nlp/gemma-2-9b-it-SimPO
posted an update 4 months ago
view post
Post
2784
I've observed that the layers targeted in various abliteration notebooks (e.g., https://colab.research.google.com/drive/1VYm3hOcvCpbGiqKZb141gJwjdmmCcVpR?usp=sharing ) appear to be arbitrary, reflecting probable brute-force exploration. This doesn't need to be the case.

Taking a cue from the paper "The Unreasonable Ineffectiveness of the Deeper Layers" ( https://arxiv.org/abs/2403.17887 ) and PruneMe (https://github.com/arcee-ai/PruneMe), it seems reasonable to target deeper layers identified as more redundant given measured similarity across layers, as the result should be less damaging to models, reducing the need for subsequent fine-tuning. Intuitively, one should expect the resulting intervention layers to be deep but not final. The only uncertainty is if the redundancy successfully encodes refusals, something which is almost certainly model-dependent. This approach only requires the redundancy to be computed once per model, and the result used as a starting point for which layer range to restrict intervention to.
posted an update 4 months ago
view post
Post
4176
I've come across theoretical justification for my prior experimentation with extremely low-weight mergers: they amount to flattening a model so its "massive activation" features remain as significant contributors. Extremely low-weight merge weights also effectively sparsify a contributing model with regard to the base model, but in a way which still preserves relationships within the flattened latent space. In the paper "Massive Activations in Large Language Models", the authors observed "very few activations exhibit significantly larger values than others (e.g., 100,000 times larger)", which in turn implies a lower bound in effective application of extremely low weight merging.
https://arxiv.org/abs/2402.17762
  • 1 reply
ยท
replied to their post 4 months ago
view reply

Not the same model, but a related model.

I created what amounted to an abliteration LoRA by contrasting original L3 Instruct against failspy's abliterated L3 Instruct, then applied and merged the L3-derived LoRA on top of original L3.1 Instruct to obtain the final model.

The result appears to outperform mlabonne's reapplication of the abliteration technique directly to L3.1 Instruct.

replied to their post 4 months ago
posted an update 4 months ago
posted an update 4 months ago
view post
Post
2752
Intelligence is all you need for roleplay.
Roleplay is overlooked as a special case of chain-of-thought, where context must be attended to and inferred state of the world and embodied minds must be persisted and evolved along credible narrative lines. LLMs are also being tasked to function as gamemasters. It's a challenging task which points to potential future benchmarks. The fact that the largest commercial LLMs are adept in generating text for roleplay intuitively implies that model intelligence is sufficient so long as it can generalize properly and pay attention to context without becoming confused.
This recent merge of mine composed using 3 academic fine-tunes, none of which were intended for roleplay, has survived the gauntlet of a Reddit post and appears to be a particularly strong 8B model when it comes to roleplay coherence.
grimjim/llama-3-Nephilim-v3-8B (bf16 weights)
grimjim/llama-3-Nephilim-v3-8B-GGUF (select quants)
  • 1 reply
ยท
replied to their post 5 months ago
view reply

Something odd is happening when merging with the OVA model. It will reduce refusals at medium (0.5-0.6) weight, but at full (1.0) weight against Instruct 8B, the result is incoherent. The LoRA should work, though!

It should also be possible to modify abliteration scripts to instead directly produce a LoRA as output.

posted an update 5 months ago
view post
Post
2265
Below we experiment with negative merger weighting (-1.0!) using task arithmetic. Merge formula on the model card and in the repo itself.

This model is steered to behave opposite to what MopeyMule demonstrated.

Based on the implications of the merge technique, we also propose Orthogonalized Vector Adaptation (OVA). We also extract a LoRA of the counter-refusal abliteration steering vector.

The resulting merger is not a perfect model, but it's a behaviorally interesting model. The model name was inspired by a Philip K. Dick story.
grimjim/Llama-3-Perky-Pat-Instruct-8B

Refusal vector weights ready for use:
grimjim/Llama-3-Instruct-abliteration-OVA-8B
grimjim/Llama-3-Instruct-abliteration-LoRA-8B
ยท
posted an update 5 months ago
posted an update 6 months ago
posted an update 6 months ago
view post
Post
1685
I propose "merge densification", a style of merger which attempts to transfer the benefits of a denser model to a base model. The model weight in this case is 0.02, which is atypically small for mergers, but high compared to the learning rate used during training. In this case, the expectation is more creative text-generation. More details below:
grimjim/kunoichi-lemon-royale-v3-32K-7B
posted an update 6 months ago
view post
Post
1379
I use mergekit regularly, and often enough get acceptable results without performing fine-tuning afterward. My current thinking is that DARE-TIES should be avoided when merging dense models, as the process of thinning inherently punches holes in models.

I've had success using SLERP merges to graft Mistral v0.1 models with Mistral v0.2 models to obtain the context length benefits of the latter, and am looking forward to experimenting with Mistral v0.3, which recently dropped.
  • 1 reply
ยท