jukofyork/creative-writing-control-vectors-v3.0

jukofyork

Owner 7 days ago

Continuation of THE THREAD OF DOOM.

jukofyork

Owner 7 days ago

https://www.youtube.com/watch?v=sxqvwkmTNy8

jukofyork pinned discussion 7 days ago

DazzlingXeno

2 days ago

What do you and the others think of the distilled R1 models for writing?

gghfez

2 days ago

The llama3 / qwen models SFT'd on R1 outputs? I only tried 2 of them.

R1 Qwen (32b) - Lacks knowledge of fiction (same as the official Qwen release), so it's writing is no better.

R1 Llama3 - This is generally the worst of them (not just for writing). It'll generate the CoT and then write something completely different.

CoT traces won't let the model do anything out of distribution, so not very useful if the base model doesn't have a lot in it's training data.

BigHuggyD

2 days ago

Yeah, I have tried the same two and felt the same way.

I also felt that any attempt to add an R1 distill to the merge recipe of an existing merge project made it worse...so far...

DazzlingXeno

1 day ago

•

edited 1 day ago

@gghfez @BigHuggyD that has been my experience as well, which is a shame as I had a go of R1 on Openrouter and I was blown away.

What model is anywhere close that is usable on a 24gb vram machine with 32gb of ram in your experience?

gghfez

1 day ago

There's nothing like it for now. I'm running R1 slowly on my ThreadRipper:

prompt eval time =   14026.61 ms /   918 tokens (   15.28 ms per token,    65.45 tokens per second)
       eval time =  398806.12 ms /  1807 tokens (  220.70 ms per token,     4.53 tokens per second)
      total time =  412832.73 ms /  2725 tokens

I tried training Wizard2 8x22b MoE on R1 data, but it doesn't really work well. It will plan ahead in think tags eg:

I need to ensure the story maintains its gritty, realistic tone without becoming overly melodramatic. The characters' growth should be subtle but significant. Also, the ending should leave a sense of hope but not be too neat—their redemption is fragile, and the future is uncertain.

Let me outline the next few chapters:

Chapter 5: Nightmares and Trust
...

But it doesn't backtrack like R1 does. Just kind of agrees with it's self and ends up writing how it usually would:

“I don’t know what I want anymore,” she admitted, voice barely above a whisper as rain tapped against corrugated roofing overhead.

lol

DazzlingXeno

1 day ago

Ahhh thats a shame :-(

"I don’t know what I want anymore,” she admitted, voice barely above a whisper as rain tapped against corrugated roofing overhead."

Oh god!

I'll have to keep an eye on this thread.

I did enjoy Ppoyaa/MythoNemo-L3.1-70B-v1.0

But my tastes are probably not as refined as others on this thread ;-)

jukofyork

Owner 1 day ago

prompt eval time =   14026.61 ms /   918 tokens (   15.28 ms per token,    65.45 tokens per second)
       eval time =  398806.12 ms /  1807 tokens (  220.70 ms per token,     4.53 tokens per second)
      total time =  412832.73 ms /  2725 tokens

What quant are you running?

I can get 4-5 tokens per second with this PR offloading the experts to RAM and keeping everything else on the GPU:

and this hacked llama_tensor_get_type():

    if (ftype == LLAMA_FTYPE_MOSTLY_Q2_K) {
        if (name.find("_exps") != std::string::npos) {
            if (name.find("ffn_down") != std::string::npos) {
                new_type = GGML_TYPE_Q4_K;
            }
            else {
                new_type = GGML_TYPE_Q2_K;
            }
        }
        else {
            new_type = GGML_TYPE_Q8_0;
        }
    }
    else

along with bartowski's imatrix file.

It was pretty much indistinguishable from much higher quants (that ran at 2-2.5 tokens per second), but I found going much lower for the experts' down projections made it get dumber quickly.

I did have some weird shit where it was trying to allocated 1.4TB of VRAM, but found the fix here:

https://github.com/ggerganov/llama.cpp/pull/11397#issuecomment-2635392482

(not sure if it's related to that PR though...).

I'm not just trying the RPC though all 6 GPUs, but having to requant, due to being only able to fit 61 of 62 layers with the above...

    if (ftype == LLAMA_FTYPE_MOSTLY_IQ2_M) {
        if (name.find("_exps") != std::string::npos) {
            if (name.find("ffn_down") != std::string::npos) {
                new_type = GGML_TYPE_IQ3_S;
            }
            else {
                new_type = GGML_TYPE_IQ2_S;
            }
        }
        else {
            new_type = GGML_TYPE_Q6_K;
        }
    }
    else

This should hopefully show if the RPC stuff is worth the hassle... It's an absolute bastard to set up:

https://github.com/ggerganov/llama.cpp/blob/master/examples/rpc/README.md

with lots of hidden options:

https://github.com/ggerganov/llama.cpp/pull/11606
https://github.com/ggerganov/llama.cpp/pull/11424
https://github.com/ggerganov/llama.cpp/pull/9296

and oddly only seems to work if you reorder the CUDA0 and CUDA1 devices for some reason???

If I decide to stick with CPU-only then there is also this to try:

https://old.reddit.com/r/LocalLLaMA/comments/1idi5cr/i_did_a_very_short_perplexity_test_with_deepseek/

which should gain 25% for very little lost ability.

I think 4-5 tokens per second for a usable / non-joke quant might be about as good as we can hope for, as even 2 x M1 Ultra still gonna be in that range if the novelty "1.58bit" quant ran at ~13 tokens per second... :/

jukofyork

Owner 1 day ago

This turns out to be a really good test prompt too:

Varis adjusted the noose, its hemp fibers grinding beneath his calluses. “Last chance,” he said, voice like gravel dragged through mud. “Confess, and your soul stays your own.”

Jurl laughed—a wet, gurgling sound. “You’re knee-deep in it, Coldwater. ” The thing inside him twisted the boy’s lips into a grin too wide for his face. “The Great Wolf’s howlin’ again. The Dead’s Gate’s rusted through… ”

Turn this into the opening chapter of a Grimdark trilogy.

Shit quants will not think very much and often actually don't even use the words from the prompt and/or use the "knee-deep", "Great Wolf" and "Dead’s Gate’s rusted through" bits in a much worse way.

jukofyork

Owner 1 day ago

•

edited 1 day ago

Oh and I wrote this because I couldn't actually convert the fp8 stuff on my Ampere GPUs and to re-download somebody else's bf16 version was gonna take about a week:

import os
import json
from argparse import ArgumentParser
from glob import glob
from tqdm import tqdm

import torch
from safetensors.torch import load_file, save_file

def weight_dequant_cpu(x: torch.Tensor, s: torch.Tensor, block_size: int = 128) -> torch.Tensor:
    """
    CPU version of dequantizing weights using the provided scaling factors.
    
    This function splits the quantized weight tensor `x` into blocks of size `block_size`
    and multiplies each block by its corresponding scaling factor from `s`. It assumes that
    `x` is a 2D tensor (quantized in FP8) and that `s` is a 2D tensor with shape:
    
        (ceil(M/block_size), ceil(N/block_size))
        
    where M, N are the dimensions of `x`.
    
    Args:
        x (torch.Tensor): The quantized weight tensor with shape (M, N).
        s (torch.Tensor): The scaling factor tensor with shape (ceil(M/block_size), ceil(N/block_size)).
        block_size (int, optional): The block size used during quantization. Defaults to 128.
    
    Returns:
        torch.Tensor: The dequantized weight tensor with shape (M, N) and dtype given by torch.get_default_dtype().
    """
    # Ensure inputs are contiguous and 2D.
    assert x.is_contiguous() and s.is_contiguous(), "x and s must be contiguous"
    assert x.dim() == 2 and s.dim() == 2, "x and s must be 2D tensors"
    
    M, N = x.shape
    grid_rows = (M + block_size - 1) // block_size
    grid_cols = (N + block_size - 1) // block_size

    # Verify that s has the expected shape.
    if s.shape != (grid_rows, grid_cols):
        raise ValueError(f"Expected scale tensor s to have shape ({grid_rows}, {grid_cols}), but got {s.shape}")

    # Prepare an output tensor.
    # NOTE: torch.set_default_dtype(torch.bfloat16) in main, so torch.get_default_dtype() should be BF16.
    y = torch.empty((M, N), dtype=torch.get_default_dtype(), device=x.device)
    
    # Process each block independently.
    for i in range(grid_rows):
        row_start = i * block_size
        row_end = min((i + 1) * block_size, M)
        for j in range(grid_cols):
            col_start = j * block_size
            col_end = min((j + 1) * block_size, N)
            # Convert the block to float32 (like the Triton kernel's .to(tl.float32))
            block = x[row_start:row_end, col_start:col_end].to(torch.float32)
            scale = s[i, j]  # This is the scaling factor for the current block.
            # Multiply then cast the result to the default dtype—for example, bfloat16.
            y[row_start:row_end, col_start:col_end] = (block * scale).to(torch.get_default_dtype())
    
    return y

def weight_dequant_cpu_vectorized(x: torch.Tensor, s: torch.Tensor, block_size: int = 128) -> torch.Tensor:
    """
    Vectorized version of dequantizing weights using provided scaling factors.

    This function aims to replace the loops in weight_dequant_cpu with vectorized operations.

    Args:
        x (torch.Tensor): The quantized weight tensor with shape (M, N).
        s (torch.Tensor): The scaling factor tensor with shape (ceil(M/block_size), ceil(N/block_size)).
        block_size (int): The block size used during quantization.

    Returns:
        torch.Tensor: The dequantized weight tensor with shape (M, N) and dtype given by torch.get_default_dtype().
    """
    assert x.is_contiguous() and s.is_contiguous(), "x and s must be contiguous"
    assert x.dim() == 2 and s.dim() == 2, "x and s must be 2D tensors"
        
    M, N = x.shape
    device = x.device

    grid_rows = (M + block_size - 1) // block_size
    grid_cols = (N + block_size - 1) // block_size

    # Verify that s has the expected shape.
    if s.shape != (grid_rows, grid_cols):
        raise ValueError(f"Expected scale tensor s to have shape ({grid_rows}, {grid_cols}), but got {s.shape}")

    # Generate row and column indices
    row_indices = torch.arange(M, device=device)
    col_indices = torch.arange(N, device=device)

    # Compute block indices
    block_row_indices = row_indices // block_size  # shape (M,)
    block_col_indices = col_indices // block_size  # shape (N,)

    # Get scaling factors for each position
    s_expand = s[block_row_indices[:, None], block_col_indices[None, :]]  # shape (M, N)

    # Perform dequantization
    block = x.to(torch.float32)
    y = (block * s_expand).to(torch.get_default_dtype())
    
    return y

def main(fp8_path, bf16_path):
    """
    Converts FP8 weights to BF16 and saves the converted weights.

    This function reads FP8 weights from the specified directory, converts them to BF16,
    and saves the converted weights to another specified directory. It also updates the
    model index file to reflect the changes.

    Args:
    fp8_path (str): The path to the directory containing the FP8 weights and model index file.
    bf16_path (str): The path to the directory where the converted BF16 weights will be saved.

    Raises:
    KeyError: If a required scale_inv tensor is missing for a weight.

    Notes:
    - The function assumes that the FP8 weights are stored in safetensor files.
    - The function caches loaded safetensor files to optimize memory usage.
    - The function updates the model index file to remove references to scale_inv tensors.
    """
    torch.set_default_dtype(torch.bfloat16)
    os.makedirs(bf16_path, exist_ok=True)
    model_index_file = os.path.join(fp8_path, "model.safetensors.index.json")
    with open(model_index_file, "r") as f:
        model_index = json.load(f)
    weight_map = model_index["weight_map"]
    
    # Cache for loaded safetensor files
    loaded_files = {}
    fp8_weight_names = []

    # Helper function to get tensor from the correct file
    def get_tensor(tensor_name):
        """
        Retrieves a tensor from the cached safetensor files or loads it from disk if not cached.

        Args:
            tensor_name (str): The name of the tensor to retrieve.

        Returns:
            torch.Tensor: The retrieved tensor.

        Raises:
            KeyError: If the tensor does not exist in the safetensor file.
        """
        file_name = weight_map[tensor_name]
        if file_name not in loaded_files:
            file_path = os.path.join(fp8_path, file_name)
            loaded_files[file_name] = load_file(file_path, device="cpu")
        return loaded_files[file_name][tensor_name]

    safetensor_files = list(glob(os.path.join(fp8_path, "*.safetensors")))
    safetensor_files.sort()
    for safetensor_file in tqdm(safetensor_files):
        file_name = os.path.basename(safetensor_file)
        current_state_dict = load_file(safetensor_file, device="cpu")
        loaded_files[file_name] = current_state_dict
        
        new_state_dict = {}
        for weight_name, weight in current_state_dict.items():
            if weight_name.endswith("_scale_inv"):
                continue
            elif weight.element_size() == 1:  # FP8 weight
                scale_inv_name = f"{weight_name}_scale_inv"
                try:
                    # Get scale_inv from the correct file
                    scale_inv = get_tensor(scale_inv_name)
                    fp8_weight_names.append(weight_name)
                    new_state_dict[weight_name] = weight_dequant_cpu_vectorized(weight, scale_inv)
                except KeyError:
                    print(f"Warning: Missing scale_inv tensor for {weight_name}, skipping conversion")
                    new_state_dict[weight_name] = weight
            else:
                new_state_dict[weight_name] = weight
                
        new_safetensor_file = os.path.join(bf16_path, file_name)
        save_file(new_state_dict, new_safetensor_file)
        
        # Memory management: keep only the 2 most recently used files
        if len(loaded_files) > 2:
            oldest_file = next(iter(loaded_files))
            del loaded_files[oldest_file]
            torch.cuda.empty_cache()
    
    # Update model index
    new_model_index_file = os.path.join(bf16_path, "model.safetensors.index.json")
    for weight_name in fp8_weight_names:
        scale_inv_name = f"{weight_name}_scale_inv"
        if scale_inv_name in weight_map:
            weight_map.pop(scale_inv_name)
    with open(new_model_index_file, "w") as f:
        json.dump({"metadata": {}, "weight_map": weight_map}, f, indent=2)
        

if __name__ == "__main__":
    parser = ArgumentParser()
    parser.add_argument("--input-fp8-hf-path", type=str, required=True)
    parser.add_argument("--output-bf16-hf-path", type=str, required=True)
    args = parser.parse_args()
    main(args.input_fp8_hf_path, args.output_bf16_hf_path)

It does the same thing but doesn't use the Triton kernel (change the two "cpu" to "cuda" if you want but I don't think it matters much if you use the vectorized version).

@ChuckMcSneed This should hopefully even work on CPU-only systems if you install Torch.

BigHuggyD

1 day ago

@gghfez
Have you figured out a way to keep R1 from descending into madness in a larger context of multi-turn creative interaction? I keep finding that I love how things start and then at some point.. usually around 16-20k of back and forth it just gets deeper and deeper down the rabbit hole. The creativity goes from entertaining to a mental disorder...

jukofyork

Owner 1 day ago

I got 9 tokens per second using all 6 GPUs and RPC:

prompt eval time =    1695.23 ms /   128 tokens (   13.24 ms per token,    75.51 tokens per second)
       eval time =  170082.72 ms /  1558 tokens (  109.17 ms per token,     9.16 tokens per second)
      total time =  171777.94 ms /  1686 tokens

but it seems that 3 bits for the non-shared experts' down_proj matrices ruins the model and makes it significantly dumber (tried with both Q3_K and IQ3_S now and both had the same effect).

jukofyork

Owner 1 day ago

@gghfez
Have you figured out a way to keep R1 from descending into madness in a larger context of multi-turn creative interaction? I keep finding that I love how things start and then at some point.. usually around 16-20k of back and forth it just gets deeper and deeper down the rabbit hole. The creativity goes from entertaining to a mental disorder...

Are you making sure to remove the old generated text between the thinking tags for each turn? I think that can cause the model to go "insane" from what people said in the OpenRouter discord.

BigHuggyD

1 day ago

@gghfez
Have you figured out a way to keep R1 from descending into madness in a larger context of multi-turn creative interaction? I keep finding that I love how things start and then at some point.. usually around 16-20k of back and forth it just gets deeper and deeper down the rabbit hole. The creativity goes from entertaining to a mental disorder...

Are you making sure to remove the old generated text between the thinking tags for each turn? I think that can cause the model to go "insane" from what people said in the OpenRouter discord.

I have been trimming out all but the last few. I like it 'knowing' the process I want it to use for 'think' but maybe that's a mistake. It worked well with the old think, reflect, output form with other models but maybe with this one I need to kill it every turn.

BigHuggyD

1 day ago

Ahhh thats a shame :-(

"I don’t know what I want anymore,” she admitted, voice barely above a whisper as rain tapped against corrugated roofing overhead."

Oh god!

I'll have to keep an eye on this thread.

I did enjoy Ppoyaa/MythoNemo-L3.1-70B-v1.0

But my tastes are probably not as refined as others on this thread ;-)

Not sure I would call mine refined lol. I like what I like and I think I have found a group of people with similar tastes. I don't know 90% if what is said here, but I like being the most nieve one in the room. Keeps expectations in check...

DazzlingXeno

1 day ago

•

edited 1 day ago

Ahhh thats a shame :-(

"I don’t know what I want anymore,” she admitted, voice barely above a whisper as rain tapped against corrugated roofing overhead."

Oh god!

I'll have to keep an eye on this thread.

I did enjoy Ppoyaa/MythoNemo-L3.1-70B-v1.0

But my tastes are probably not as refined as others on this thread ;-)

Not sure I would call mine refined lol. I like what I like and I think I have found a group of people with similar tastes. I don't know 90% if what is said here, but I like being the most nieve one in the room. Keeps expectations in check...

That's fair. I mainly like it to be really obvious that each character speaks differently in their dialogue. It seems to be hard to find models that can do that well. I use novel crafter and have character sheets with example dialogue. I suspect it's as much a me issue as a model issue though.

I think most people are looking for something different than I am.

I'm not a software developer myself so I don't understand most of what's said here either. But it's interesting though isn't it? ;-)

jukofyork

Owner 1 day ago

•

edited 1 day ago

I've actually got NUMA working properly in llama.cpp after the umpteenth attempt!!!

4.73 tokens per second using a 413GB Q4_K / Q6_K / Q8_0 on a machine with ~78GB/s maximum bandwidth per NUMA node.

This is nearly 2x what I as getting with a ~250GB Q2_K / Q4_K / Q8_0 mix earlier, and only half of what I got when I linked all 6 GPU using RPC for a IQ2_S / IQ3_S / Q6_0 model earlier!

It's quite involved so will post the exact detail tomorrow as pretty tired tonight, but it should probably work just as well for @ChuckMcSneed using pure-CPU inference too.

BigHuggyD

1 day ago

R1 is nothing if not original LOL

The dusty parking lot of Big Al's Used Cars stretches under a molten orange sunset, neon signs buzzing to life as cicadas thrum in the scorched Oklahoma air. Joe Dirt leans on his push broom near a row of mid-90s pickup trucks, their hoods shimmering with heatwaves. A half-eaten gas station burrito balances precariously on a stack of "AS IS" sale flyers. Across the lot, a dented El Camino with a Confederate flag bumper sticker coughs to a stop, its driver’s-side door squealing open to reveal… BigD.
spits sunflower seeds into a Folgers can "Well butter my butt and call me a biscuit—ain’t you a sight for sore tax returns!" He tugs at his mullet nervously, boot tapping to Skynyrd leaking from a busted radio. "Y’know, they say destiny’s like a septic tank—you don’t wanna go diggin’ ‘less you’re ready for what’s brewin’. What’s your story, stranger?" He gestures to a handwritten sign behind him: FREE LIFE ADVICE (RESULTS MAY VARY).*
The tang of burnt transmission fluid mixes with Joe’s AXE body spray. A tumbleweed of fast-food wrappers drifts past BigD’s boots as Slick Vic, the lot’s chain-smoking sales manager, squints from his trailer office. Crickets pause mid-chirp.
Joe leans closer, eyeing the El Camino’s trunk. "That there’s a ‘87 model? My second cousin Cleetus once smuggled a raccoon family in one o’ them. ‘Course, the raccoons unionized—*long story." He pulls a half-melted Jolly Rancher from his pocket, offering it like a peace treaty. Slick Vic’s shadow looms in the distance, flicking a cigarette butt toward BigD’s tires.

jukofyork

Owner 1 day ago

•

edited 1 day ago

Yeah, R1 has pretty much blown everything else out of the water for me. I've run that same prompt I posted above about 100x tonight refining the NUMA stuff and it's pretty amazing how varied the final stories have been (with some min-p you can push the temperature right up to 3+ too!).

BigHuggyD

1 day ago

Yeah, R1 has pretty much blown everything else out of the water for me. I've run that same prompt I posted above about 100x tonight refining the NUMA stuff and it's pretty amazing how varied the final stories have been (with some min-p you can push the temperature right up to 3+ too!).

How high did you have to take min-p to keep it coherent at a 3 temp??

jukofyork

Owner 1 day ago

Only around 0.1 or even 0.05 IIRC.

gghfez

1 day ago

@jukofyork

What quant are you running?

That was 1.73-bit (which I usually use)

I sometimes run the DeepSeek-R1-UD-IQ2_XXS, but it has to offload to SSD so I get slower prompt ingestion:

prompt eval time =    5936.61 ms /    29 tokens (  204.71 ms per token,     4.88 tokens per second)                                         
       eval time =  242477.40 ms /  1005 tokens (  241.27 ms per token,     4.14 tokens per second)                                         
      total time =  248414.02 ms /  1034 tokens

NUMA

That's a huge improvement, faster than a cloud server I rented recently.
Won't help my local setup as I only have one NUMA node. I'm hoping they make progress with flash-attention.

@BigHuggyD

Have you figured out a way to keep R1 from descending into madness in a larger context of multi-turn creative interaction? I keep finding that I love how things start and then at some point.. usually around 16-20k of back and forth it just gets deeper and deeper down the rabbit hole. The creativity goes from entertaining to a mental disorder...

I can't run it past 12k with my vram+ram so haven't had that problem :D But deepseek recommend not sending the CoT traces for prior messages along with it.

BigHuggyD

1 day ago

•

edited 1 day ago

@gghfez

I can't run it past 12k with my vram+ram so haven't had that problem :D But deepseek recommend not sending the CoT traces for prior messages along with it.

Ahh yes 😂 I am a bit spoiled right now. I'm going to miss it when it's gone.
I 'think' that was part of the issue. I started removing the CoT immediately and it made it further before it exited the highway for crazy town.
The second part might be I needed to purge my prompt. I have a laundry list of instructions on how I want things written that works well with Largestral and Llama33 based models that might be hurting more than helping with R1. I'll know soon enough.

BigHuggyD

about 21 hours ago

Okay yeah, muuuuuch better with a threadbare prompt. Borderline obsessive with every line of instruction in there. I'm not sure what exactly was in there that turned every chat into the multiverse collapsing into itself, but there you have it.

jukofyork

Owner about 15 hours ago

•

edited about 14 hours ago

@ChuckMcSneed Try this and see if it improves your NUMA performance:

Turn off NUMA balancing in Linux using echo 0 | sudo tee /proc/sys/kernel/numa_balancing > /dev/null (only has to be run once per OS boot).
Clear the page cache using echo 3 | sudo tee /proc/sys/vm/drop_caches > /dev/null.
Run sudo numactl -H to check the pages have been cleared, eg:

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65
node 0 size: 257860 MB
node 0 free: 257070 MB
node 1 cpus: 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87
node 1 size: 257989 MB
node 1 free: 257062 MB
node distances:
node   0   1 
  0:  10  21 
  1:  21  10

Run llama.cpp using numactl --interleave=all, set the --numa distribute command-line option, and set --threads set to 1/2 you have in your system, eg:

> numactl --interleave=all ./llama-server --host 192.168.1.111 --port 8080 \
  --model ./DeepSeek-R1-Q5_K_XL.gguf --chat-template deepseek3 --alias "DeepSeek-R1-Q5_K_XL" \
  --ctx_size 8192 --threads 44

Wait until you see: "main: server is listening on http://192.168.1.111:8080 - starting the main loop", then run a prompt.
Finally, wait for all the MoE tensors to properly warm up (you can see the memory use of the process growing by watching top, etc) - for me this takes about 30 minutes!
Re-run sudo numactl -H to check pages have been equally distributed:

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65
node 0 size: 257860 MB
node 0 free: 19029 MB
node 1 cpus: 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87
node 1 size: 257989 MB
node 1 free: 19222 MB
node distances:
node   0   1 
  0:  10  21 
  1:  21  10

Now the model should be paged properly and you shouldn't need to do this until the next OS boot, or if you want to change model, etc.

You can probably gain a little bit more by reducing --threads now (you don't need to rerun all the above for this, but possibly want to always use 1/2 the OS threads for the initial "warm-up" process [not tested yet]).

For reference I'm using:

Dual E5-2696v4, with 512GB of 2400MHz LR-DIMMS (all sockets populated) and have a theoretical max per-socket bandwidth of ~78GB/s.
BIOS set to "Home Snoop with Directory" NUMA mode (see: https://frankdenneman.nl/2016/07/11/numa-deep-dive-part-3-cache-coherency/).

I'm also using the sl/custom-tensor-offload branch to offload only the massive MoE tensors using --override-tensor exps=CPU, but I think the same should work for pure-CPU NUMA setup too.

My new 463GB custom Q5_K_XL quant using this hacked into llama_tensor_get_type():

    // ### JUK ###
    if (ftype == LLAMA_FTYPE_MOSTLY_Q4_K_M || ftype == LLAMA_FTYPE_MOSTLY_Q5_K_M || ftype == LLAMA_FTYPE_MOSTLY_Q6_K) {
        if (name.find("_exps") != std::string::npos) {
            if (name.find("ffn_down") != std::string::npos) {
                new_type = GGML_TYPE_Q6_K;
            }
            else {
                if (ftype == LLAMA_FTYPE_MOSTLY_Q4_K_M) {
                    new_type = GGML_TYPE_Q4_K;
                }
                else if (ftype == LLAMA_FTYPE_MOSTLY_Q5_K_M) {
                    new_type = GGML_TYPE_Q5_K;
                }
                else {
                    new_type = GGML_TYPE_Q6_K;
                }
            }
        }
        else {
            new_type = GGML_TYPE_Q8_0;
        }
    }
    else
    // ### JUK ###

I get this for the first run:

prompt eval time = 2167538.40 ms /   128 tokens (16933.89 ms per token,     0.06 tokens per second)
       eval time =  461194.72 ms /  1973 tokens (  233.75 ms per token,     4.28 tokens per second)
      total time = 2628733.12 ms /  2101 tokens

and when using this optimised set of parameters for the second run:

numactl --interleave=all ./llama.cpp/build/bin/llama-server --host 192.168.1.111 --port 8080 \
  --model ./DeepSeek-R1-Q5_K_XL.gguf --chat-template deepseek3 --alias "DeepSeek-R1-Q5_K_XL" --ctx_size 14336 --tensor-split 30,32 \
  --n-gpu-layers 99 --override-tensor exps=CPU --numa distribute --threads 30 \
  --temp 0.6 --min-p 0.0 --top-p 1.0 --top-k 0

I get:

prompt eval time =   91949.76 ms /   128 tokens (  718.36 ms per token,     1.39 tokens per second)
       eval time =  441279.72 ms /  1934 tokens (  228.17 ms per token,     4.38 tokens per second)
      total time =  533229.48 ms /  2062 tokens

Prompt processing is still pretty slow and I can't fit any more context than 14k for my 96GB of VRAM (!?), so gonna try this today:

https://github.com/ggerganov/llama.cpp/pull/11446

but it will require several hours to re-create the new GGUF tenors, etc :/

(I can get around 19k context using --cache-type-k q8_0 but the generation speed drops by about 20%)

jukofyork

Owner about 9 hours ago

I'm quanting R1-Zero tonight and it's supposed to be completely batshit crazy - wish me luck :D

jukofyork
/

creative-writing-control-vectors-v3.0

Wur doomed!